Research Article  Open Access
Maitane Barrenechea, Mikel Mendicute, Egoitz Arruti, "Fully Pipelined Implementation of TreeSearch Algorithms for Vector Precoding", International Journal of Reconfigurable Computing, vol. 2013, Article ID 496013, 12 pages, 2013. https://doi.org/10.1155/2013/496013
Fully Pipelined Implementation of TreeSearch Algorithms for Vector Precoding
Abstract
The nonlinear vector precoding (VP) technique has been proven to achieve closetocapacity performance in multiuser multipleinput multipleoutput (MIMO) downlink channels. The performance benefit with respect to its linear counterparts stems from the incorporation of a perturbation signal that reduces the power of the precoded signal. The computation of this perturbation element, which is known to belong in the class of NPhard problems, is the main aspect that hinders the hardware implementation of VP systems. To this respect, several treesearch algorithms have been proposed for the closestpoint lattice search problem in VP systems hitherto. Nevertheless, the optimality of these algorithms has been assessed mainly in terms of errorrate performance and computational complexity, leaving the hardware cost of their implementation an open issue. The parallel dataprocessing capabilities of fieldprogrammable gate arrays (FPGA) and the loopless nature of the proposed treesearch algorithms have enabled an efficient hardware implementation of a VP system that provides a very high dataprocessing throughput.
1. Introduction
Since the presentation of the vector precoding (VP) technique [1] for data transmission over the multiuser broadcast channel, many algorithms have been proposed in the literature to replace the computationally intractable exhaustive search defined in the original description of the algorithm. To this respect, lattice reduction approaches have been widely used as a means to compute a suboptimum perturbation vector with a moderate complexity. The key idea of latticereduction techniques relies on the usage of an equivalent and more advantageous set of basis vectors to allow for the suboptimal resolution of the exhaustive search problem by means of a simple rounding operation. This method is used in [2], where the LenstraLenstraLovász (LLL) reduction algorithm [3] is used to yield the Babai's approximate closestpoint solution [4]. Similar approaches can be found in [5–7]. Despite achieving full diversity order in VP systems [8, 9], the performance degradation caused by the quantization error due to the rounding operation still remains. Moreover, many lattice reduction algorithms have a considerable computational complexity, which poses many challenges to a prospective hardware implementation.
An appropriate perturbation vector can also be found by searching for the optimum solution within a subset of candidate vectors. These approaches, also known as treesearch techniques, perform a traversal through a tree of hypotheses with the aim of finding a suitable perturbation vector.
In spite of the high volume of research work published around the topic of precoding algorithms, the issues raised by their implementation have not been given the same attention. Some of the few publications on this area, such as [10–12], describe precoding systems that either have a considerable complexity in terms of allocated hardware resources or provide a rather low data transmission rate.
Despite the lack of published research in the area of hardware architectures for precoding algorithms, the implementation issues of treesearch schemes in MIMO detection scenarios have been widely studied. For example, the field programmable gate array (FPGA) implementation of the fixedcomplexity sphere decoder (FSD) detector has been analyzed in [13–15], whereas the hardware architecture of the Kbest tree search considering a real equivalent model was researched in [16–21]. Moreover, the implementation of a Kbest detector with suboptimum complexplane enumeration was performed in [22, 23]. A thorough review of Kbest treesearch implementation techniques was carried out in [24]. The adaptation of these treesearch schemes to precoding systems implies many variations with respect to the original description of the algorithms. Even if many lessons can be learned from the hardware architecture of treesearch techniques for pointtopoint MIMO systems, the peculiarities of the precoding scenario render the results of the aforementioned publications inadequate for the current research topic.
Consequently, this contribution addresses the highthroughput implementation of fixedcomplexity treesearch algorithms for VP systems. More specifically, two stateoftheart treesearch algorithms that allow for the parallel processing of the tree branches have been implemented on a Xilinx Virtex VI FPGA following a rapidprototyping methodology. In order to achieve a high throughput, both schemes operate in the complex plane and have been implemented in a fullypipelined fashion providing one output per cycle.
This contribution is organized as follows: in Section 2 the system model is introduced, followed by a short review provided in Section 3 on the noniterative treesearch algorithms to be implemented. Next, the general hardware architecture of the data perturbation process is outlined in Section 4 whereas the specific features of the Kbest and FSE modules are analyzed in Sections 5 and 6, respectively. An analysis on the treesearch parameters for both techniques is performed in Section 7 and the hardware implementation results are shown in Section 8. Finally, some concluding remarks are drawn in Section 9.
Notation. In the remainder of the paper the matrix transpose and conjugate transpose are represented by and , respectively. We use to represent the identity matrix and to denote the modulo operator. The set of Gaussian integers of dimension , namely, , is represented by .
2. System Model
Consider the multiuser broadcast channel with antennas at the transmitter and singleantenna users, denoted as where . We assume that the channel between the base station and the users is represented by a complexvalued matrix , whose element represents the channel gain between transmit antenna and user . For all simulations, the entries of the channel matrix are assumed independent and identically distributed with zeromean circularly symmetric complex Gaussian distribution and .
The precoding system under study is shown in Figure 1. According to the aforementioned model, the data received at the user terminals can be collected in the vector , which is given by where represents the additive white Gaussian noise vector with covariance matrix .
In precoding systems such as the one depicted in Figure 1, the independent data acquisition at the receivers is enabled by a preequalization stage at the base station. This procedure, which is carried out by means of a precoding filter , anticipates the distortion caused by the channel matrix in such a way that the received signal is (ideally) fully equalized at the receive terminals. However, the precoding process causes variations in the power of the user data streams, and therefore, a power scaling factor is applied to the vector of precoded symbols prior to transmission to ensure a certain transmit power . At the user terminals, the received signal is scaled by again to allow for an appropriate detection of the data symbols. Hence, the signal prior to the detection stage reads as
From this equation, one can notice that in the event of , or equivalently , an increase in the power of the noise vector is experienced at the receivers, which greatly deteriorates the errorrate performance of the system. To this respect, nonlinear signal processing approaches aim at reducing the power of the linearly precoded symbols, in such a way that a considerable performance enhancement can be attained. In VP systems, this objective is achieved by incorporating a perturbation signal prior to the linear precoding stage.
The data perturbation process is supported by the modulo operator at the receivers, which provides the transmitter with additional degrees of freedom to choose the perturbation vector that is most suitable. Note that the perturbation signal must be composed of integer multiples of the modulo constant , namely, with , so that it can be easily removed at the receivers by means of a simple modulo operation.
The VP system model that achieves the best errorrate performance targets the minimization of the mean square error (MSE) instead of the traditional goal of reducing the average power of the transmitted symbols [25]. In this model, the precoding matrix is designed as , with , and the triangular matrix used for the computation of the perturbation signal is computed as . Finally, the optimum perturbation vector is obtained by evaluating the following cost function:
The computation of the perturbing signal in (3) entails a search for the closest point in a lattice. Several techniques to efficiently obtain the perturbation signal will be analyzed in the following section.
3. TreeSearch Techniques for Vector Precoding
The triangular structure of the matrix in (3) enables the gathering of all the solution vector hypotheses in an organized structure which resembles the shape of a tree. Following the analogy with the tree structure, the concatenation of lattice elements or nodes is referred to as a branch, where a branch of length represents a candidate solution vector. The search for the perturbation vector is then performed by traversing a tree of levels (each one representing a user) starting from the root level , and working backwards until .
Since the elements of the solution vector belong to the expanded search space , the amount of nodes that originate from each parent node in the tree equals in theory. However, depending on the treetraversal strategy to be followed, the cardinality of this set can be reduced either artificially, by limiting the search space to the group of closest points to the origin, or by identifying the set of eligible nodes following a distance control policy (also known as the sphere constraint). Note that, as opposed to the pointtopoint MIMO detection scenario, the amount of child nodes that stem from the same parent node does not depend on the modulation constellation in use. This way, the computation of the (squared) Euclidean distances in (3) can be distributed across multiple stages as follows: where
The partial Euclidean distance (PED) associated with a certain node at level is denoted as , while the accumulated Euclidean distance (AED) down to level is given by . Since the elements of the solution vector belong to the expanded search space , the amount of nodes that originate from each parent node in the tree equals in theory. However, depending on the treetraversal strategy to be followed, the cardinality of this set can be reduced either artificially, by limiting the search space to the group of closest points to the origin, or by identifying the set of eligible nodes following a distance control policy (also known as the sphere constraint).
The traversal of the search tree is usually performed following either a depthfirst or breadthfirst strategy.
3.1. DepthFirst TreeSearch Techniques
Depthfirst treesearch techniques traverse the tree in both forward and backward directions enabling the implementation of a sphere constraint for pruning of unnecessary nodes based on, for example, the Euclidean distance associated with the first computed branch. The pruning criterion, which is updated every time a leaf node at level with a smaller AED is reached, does not impose a perlevel runtime constraint, and therefore, the complexity of these algorithms is of variable nature.
One of the most noteworthy depthfirst techniques is the SE algorithm [26, 27], which restricts the search for the perturbation vector to the set of nodes with that lie within a hypersphere of radius centered around a reference signal. The good performance of the algorithm is a consequence of the identification and management of the admissible set of nodes at each stage of the tree search. Every time a forward iteration is performed () the algorithm selects and computes the distance increments of the nodes that fulfil the sphere constraint and continues the tree search with the most favorable node according to the SchnorrEuchner enumeration [28] (the node resulting in the smallest ). This process is repeated until a leaf node is reached (which will result in a radius update) or no nodes that satisfy the sphere constraint are found. In any case, the SE will proceed with a backward iteration () where a radius check will be performed among the previously computed set of candidate points. If a node with is found, the tree traversal is resumed with a forward iteration. The optimum solution has been found when the hypersphere with the updated radius contains no further nodes.
The radius reduction strategy along with the tracking of potentially valid nodes at each level of the algorithm prevents unnecessary distance computations but ultimately results in a rather complex treesearch hardware architecture.
3.2. BreadthFirst TreeSearch Techniques
Breadthfirst treesearch algorithms with upperbounded complexity traverse the tree in only forward direction, identifying a set of potentially promising nodes and expanding only these in the subsequent treelevels. These algorithms benefit from a fixed and high dataprocessing throughput that stems from the parallel processing of the branches. Nevertheless, the speculative pruning carried out during the tree search prevents bounded breadthfirst algorithms from achieving an optimum performance.
As one can guess from its name, the Kbest precoder [29, 30] selects the best branches at each level of the tree regardless of the sphere constraint or any other distance control policy. At each stage of the Kbest tree search, an ordering procedure has to be performed on the eligible candidate branches based on their AEDs down to level . After the sorting procedure, the paths with the minimum accumulated distances are passed on to the next level of the tree. Once the final stage of the tree has been reached, the branch with the minimum Euclidean distance is selected as the Kbest solution. Clearly, the main bottleneck in this scheme stems from the node ordering and selection procedures performed at every level of the tree search.
The fixedcomplexity sphere encoder (FSE) was presented in [31] as a sortfree alternative to the aforementioned Kbest precoder. The proposed scheme avoids the intricate sorting stages required by the Kbest by defining a node selection procedure based on a tree configuration vector . This vector specifies the number of child nodes to be evaluated at each level () following the SchnorrEuchner enumeration. Therefore, only PEDs are computed per parent node at each level, yielding a total candidate branch count of .
Both fixedcomplexity algorithms achieve a high dataprocessing throughput due to their capability of parallel branch computation. This highspeed dataprocessing feature will be assessed by carrying out the hardware implementation of a vector precoder based on the fixedcomplexity algorithms under study. The main objectives of this study are twofold: on one hand, the quantification of the datatransmission throughput of the proposed architectures, and on the other hand, the assessment of the hardware resource allocation required for their implementation.
4. General Architecture Overview
Both treesearch schemes share the same general distance computation structure, as can be seen in Figure 2. The lack of loops in the hardware architecture of the fixedcomplexity treesearch techniques enables a high throughput and fullypipelined implementation of the data perturbation process, thus being its implementation specially suitable for a target FPGA device.
The AEDs of the candidate branches are computed by accumulating the PEDs calculated at the local distance processing units (DPUs) to the AEDs of the previous level. This way, the AEDs down to level corresponding to the considered candidate branches, namely, , are passed on from DPU to DPU . The parameter stands for the number of candidate branches at each level of the tree search, being it for the Kbest and for the FSE model.
Two input memory blocks, named Data Memory and Channel Memory, have been included to store the data symbols and the values of the triangular matrix , respectively. The offdiagonal matrix coefficients are stored as , whereas the diagonal values are in the form of to simplify the calculation of (4) and (8). Note that the matrix preprocessing stage required by the FSE and Kbest approaches has not been included in the hardware design. The computation of the intermediate points requires the values of all previous . To avoid redundant calculations, the set of values for all is transferred to DPU , as is shown in Figure 2.
The hardware structure of the first DPU is common in both schemes. The computation of the Euclidean distances in this level does not involve any data from previous levels, and therefore, the only operation to be performed is to select the lattice values closest to and to compute the corresponding PEDs. Given that the position of the modulation's constellation within the complex lattice is known beforehand, and considering the symmetries of the complex lattice, it is possible to select the nodes to be passed on to the next level without performing any extra distance calculations and sorting procedures. Additionally, the hardware structure of the last DPU is also equal for both algorithms, as only the most favorable child node that stems from each one of the parent nodes needs to be expanded at this level. Such a task can be performed by simply rounding the value of to the position of the nearest lattice point.
The main and crucial differences between the FSE and Kbest treesearch algorithms rely on the DPUs of levels .
5. DPU for the Best
The difficulty of performing the sorting procedure in the complex plane, where the amount of nodes to be considered is higher, and the intricacy of complexplane enumeration have led to the dominance of realvalued decomposition (RVD) as the preferred technique when implementing the Kbest tree search. Nevertheless, direct operation on the complex signals is preferred from an implementation point of view as the length of the search tree is halved, and hence, the latency and critical path of the design can be shortened.
5.1. Structure of the Sorting Stage
Regardless of the domain of the signals to be used, the bottleneck in this type of systems is usually the sorting stage performed at each tree level. The number of child nodes that stem from the same parent node will be defined as , being its value for the complexplane model, whereas will be required for the RVD scheme. The PED calculation and subsequent sorting procedure on the child nodes at each level is a computationally expensive process that compromises the throughput of the whole system. With the aim of alleviating the burden of the sorting stage, the use of the SchnorrEuchner ordered sequence of child nodes and the subsequent merging of the sorted sublists is proposed in [17]. Even if the proposed scheme is implemented on an RVD model due to the simplicity of the local enumeration, it is possible to extend it to the complex plane if a lowcomplexity enumerator, such as the puzzle enumerator presented in [32], is utilized. Additionally, a fullypipelined RVD architecture of the sorted sublists algorithm is proposed in [33] for highthroughput systems. By dividing the real axis into regions and storing the corresponding enumeration sequences in lookup tables (LUTs), the algorithm is able to determine the child node order by means of a simple slicing procedure. Nevertheless, this technique is advantageous only when operating with RVD symbols as the amount of data to be stored and the quantity of nonoverlapping regions grow remarkably when complexvalued symbols are utilized. In any case, the use of any of the aforementioned sublist merging approaches reduces the amount of PED computations to be performed at each level to .
5.1.1. The Winner Path Extension Algorithm
The number of costly distance computations can be further reduced by implementing the winner path extension (WPE) selection approach presented in [34] and incorporated into the RVD hardware architecture of the Kbest tree search in [35, 36]. The proposed scheme selects the most favorable branches in iterations by performing just PED computations. An illustrative example of the WPE algorithm is depicted in Figure 3 for a system with and . The child nodes at a certain tree level are tagged as , where represents the index of the parent node and denotes the position of that certain node within the ordered SchnorrEuchner sequence of child nodes.
The WPE sorting procedure is based on the generation and management of a node candidate list . This way, the child node corresponding to the th most favorable branch is extracted from the candidate list of the th sorting stage . The initial values in the candidate list are comprised of the AEDs down to level of the best child nodes that stem from each one of the parent nodes, which gives . The winner branch in the initial sorting stage, or equivalently the first of the most favorable branches, is selected as the branch with the smallest AED within . The PED of the second most favorable child node that stems from the same parent node as the latest appointed winner branch is computed ( in the example illustrated in Figure 3) and the AED of the resulting tree branch is added to the candidate list . The algorithm proceeds accordingly until the required branches have been identified.
Additionally note that, according to the complexity analysis based on the amount of comparisons in a WPEenabled Kbest tree search published in [35], the complexity of an RVDbased tree traversal doubles that of its complexplane counterpart.
5.2. Structure of the Kbest DPU
The structure of the proposed Kbest DPU is depicted in Figure 4 for a system with . The branch selection procedure is carried out in fullypipelined sorting stages following a modified version of the WPE algorithm presented in [33, 34]. First of all, the computation of the intermediate points is performed for each one of the branches that are passed on from the previous level. The set of best child nodes that stem from each parent node can be computed by simply rounding off the value of the intermediate point to the nearest lattice point. The distance increments ( in (4)) for those best children are computed by metric computation unit (MCU) and are accumulated with their corresponding values. These distance values and their corresponding branches comprise the candidate list . The minimum AED within is found at the minimumsearch unit (MSU) by simple concatenation of compareandselect blocks. The MSU also outputs the index of the first winner branch so that the appropriate value of can be selected for the local enumeration procedure.
At the second stage of the sorting procedure, the node needs to be identified for any parent node index . This task is performed by the block, which comprises a puzzle enumerator that outputs the second most favorable node given a certain value of . However, in the subsequent stages of the algorithm, the enumeration procedure will depend on the index of the previously appointed winner branches. Hence, if , the third most promising child node will need to be expanded, namely, , whereas the second most favorable node in the branch () will be required if . Consequently, the new candidate branch to be included in the candidate list at the th sorting stage will require the expansion of the th most favorable child node, where may take any value within the set .
The enumeration approach at each sorting stage has been carried out by means of a puzzle enumerator unit capable of ascertaining the optimum ordered sequence of the first child nodes in a nonsequential fashion. The node order determination in the puzzle enumerator can be carried out without performing any costly distance computations. For each sorting stage any node in the ordered sequence of best child nodes can be selected for expansion. This way, the desired child node in the ordered sequence is determined by an additional input variable which keeps track of the amount of already expanded child nodes for each parent node. The puzzle enumerator has been selected as the enumeration scheme to be used along with the WPE due to its lower hardware resource demand and nonsequential nature, as discussed in [32]. Note that, there are no feedback loops in the structure of the Kbest DPU, and therefore, it is possible to implement it following a fullypipelined scheme.
6. DPU for the FSE
The intricate node ordering and selection procedure required by the Kbest algorithm is replaced by a simple SchnorrEuchner enumerator in the FSE treesearch model. This derives in a considerably simpler DPU architecture of the FSE scheme.
Figure 5 depicts the structure of the FSE DPU, where the block diagram for and is represented. First of all, the data of the values transferred from level are used to compute the intermediate values for each one of the parent nodes. Afterwards, the node selection procedure is performed by means of a simple rounding operation when , as depicted in the illustrative example in Figure 5, or by means of the unordered puzzle enumerator [32] for the cases where . The PEDs of the selected nodes are then computed by MCUs and accumulated to the AEDs from the previous level. Finally, as was the case with the Kbest DPU, the FSE DPU does not have any feedback paths in its design, and hence, it can be easily implemented following a fullypipelined scheme.
7. Design Considerations
This section addresses the design parameter selection for the fixedcomplexity algorithms to be implemented in hardware. Additionally, the impact of applying an approximate norm for the computation of the distance increments is studied from an errorrateperformance point of view.
7.1. Choice of the Design Parameters
The configurable parameters and offer a flexible tradeoff between performance and complexity for the Kbest and FSE encoders, respectively. These configuration parameters establish the shape of the search tree, which in turn determines the amount of hardware multipliers required for its implementation. Embedded multipliers are scarce in FPGA devices and are considered an expensive resource in applicationspecific integrated circuit (ASIC) designs. Thus, the number of multiplication units required by the treesearch algorithm has been regarded as the critical factor in the current hardware architecture design. For the sake of a fair comparison, the configuration parameters of the fixedcomplexity treesearch methods have been selected so as to yield a similar amount of allocated embedded multipliers.
Considering that multipliers are used for the multiplication of two complex terms, the number of multiplication units required for the Kbest treesearch structure can be computed as whereas the total amount of embedded multipliers for an FSE tree structure is given by
The number of required embedded multipliers for the Kbest and FSE treesearch techniques is shown in Figure 6 for a system with singleantenna users. The amount of multiplier units is given as a function of the number of candidate branches, namely, and for the Kbest and FSE approaches, respectively. As one can notice, the amount of hardware resources in the Kbest treesearch model grows linearly with the number of considered candidate branches. However, this constant growth rate does not apply for the FSE case. This is due to the values being differently distributed through the tree configuration vector depending on the divisibility of . As proved in [37], among all possible tree configuration vectors that yield the same value of , the one with the most dispersedly distributed values of achieves the best errorrate performance and requires the lowest amount of allocated embedded multipliers.
In order to assess the hardware resource occupation required for the the implementation of the different treesearch algorithms, the design parameter values and have been selected for the Kbest and FSE models, respectively. This choice of parameters ensures a similar multiplier occupation for both schemes and offers a significantly better errorrate performance than other lower and pairs, for example, and .
The BER versus SNR curves of the implemented fixedcomplexity schemes are depicted in Figure 7. As one can notice, the errorrate performance of the implemented models is close to the optimum set by the SE in the lowtomid SNR range. However, a performance degradation of 0.5 dBs is noticeable for the FSE model at highSNRs, whereas the performance gap of the Kbest structure increases with the SNR, reaching up to 3 dBs at a BER of .
7.2. Implementation of an Approximate Norm
A significant portion of the hardware resources in the implementation of any treesearch algorithm is dedicated to computing the norms required by the cost function in (3). Additionally, the long delays associated with squaring operations required to compute the PEDs account for a significant portion of the latency of the fixedcomplexity treesearch architectures. It is possible to overcome these problems by using an approximate norm that prevents the use of the computationally expensive squaring operations.
The application of the modifiednorm algorithm (MNA) [38] entails two main benefits: on one hand, a simplified distance computation scheme that immediately reduces silicon area and delay of the arithmetic units can be performed, and on the other hand, a smaller dynamic range of the PEDs is achieved. The key point of the MNA is to compute the square root of the accumulated and partial distance increments, namely, and , respectively. Hence, the accumulation of the distance increments in this equivalent model gives . An approximate norm can now be applied to get rid of the computationally expensive squaring and square root operations, such that . This way, the accumulated distance computation in (4) can be reformulated as with for the norm variant of the algorithm. The norm approximation can also be performed following the norm simplified model, in which case the following expressions should be considered with
The implementation of an approximate norm impacts the errorrate performance of the VP system differently depending on the treesearch strategy used in the perturbation process. This fact is shown in Figure 8, where the BER performance degradation introduced when approximating the norm by the suboptimum and norms is depicted for the FSE and Kbest treesearch approaches. For the FSE case depicted in Figure 8(a), the use of an approximate norm only affects the accumulated distances related to the candidate branches, but not the branches themselves. This is due to the fact that the nodes expanded at each level where are the same regardless of the norm used to compute the distance increments to . In the Kbest model, on the other hand, the node selection procedure is solely based on previously computed distances, and therefore, the introduction of an approximate norm will noticeably alter the structure of the candidate branches. Consequently, a higher errorrate performance degradation of the Kbest algorithm with an approximate norm can be expected when compared to the normsimplified FSE model.
(a) FSE
(b) Kbest
The implementation of the approximate norm yields a highSNR performance loss of 0.22 dB and 0.25 dB for the FSE and Kbest fixedcomplexity algorithms, respectively. Due to the worse approximation of the Euclidean distances performed by the suboptimum norm, the performance gap with respect to the optimum FSE and Kbest structures is widened in this case. This way, a performance loss of 0.45 dBs is experienced by the simplified FSE model, whereas an errorrate degradation of 0.85 dBs is suffered by the Kbest in the highSNR regime. In any case, the implementation of an alternative norm does not alter the diversity order of the VP scheme.
The computational complexity reduction yielded by both norm approximation approaches is similar, whereas the performance is slightly better for the norm. Consequently, the normsimplified model will be considered for hardware implementation.
8. Implementation Results
The proposed treesearch architectures have been implemented on a Xilinx Virtex VI FPGA (XC6VHX250T3). The occupation results have been obtained by means of the place and route tool included in the System Generator for DSP software.
Table 1 depicts the device occupation summary of the implemented vector precoders for an users system with eligible lattice points. Even if the FSE and Kbest models use a similar amount of embedded multipliers (DSP48e1), the device occupation in terms of slices is considerably higher for the latter. This is due to the longer latency of the Kbest architecture caused by the distributed sorting procedure, which ultimately results in a great amount of data being stored in several pipeline stages. As a consequence to this, around of the slice LUTs are used as memory in the Kbest implementation, as opposed to the utilized by the FSE for the same purpose. Other than the higher occupation due to pipeline registers, the difference in latency between the two designs is of minor importance as both structures are fullypipelined and therefore output a processed data vector at every clockcycle. As already anticipated, the utilization of the approximate norm yields a notable reduction in the amount of allocated embedded multipliers for both fixedcomplexity treesearch models.

The maximum throughput of the implemented architectures in terms of processed gigabits per second is also shown in Table 1 for a 16QAM modulation constellation. For a system with users and a constellation of elements, the throughput for fullypipelined architectures can be computed as where represents the maximum working frequency of the design as given by the PostPlace and Route Static Timing Report. Both treesearch algorithms achieve a very high dataprocessing throughput (in the range of 5 Gbps) due to the loopless parallel structure that enables the processing of a new data vector at every clock cycle. Note that the maximum working frequency of the designs presented in this contribution can be obtained by using (12) and the results in Table 1.
Additionally note that a higher throughput can be achieved by increasing the order of the modulation in use. In such a case, the modifications to be performed in the proposed architecture are minimal. These include an update of the considered lattice values () and the adaptation of the first DPU where the straightforward node sequencing is performed. Furthermore, given the low hardware resource occupation required by the proposed FSE treesearch architectures, a higher data processing throughput can be easily obtained by running several treesearch instances in parallel.
Table 2 compares the area and throughput of the proposed Kbest and FSE hardware architectures with similar structures used in pointtopoint MIMO detection. Even if a direct comparison should be done carefully due to the already described differences between multiuser precoding and MIMO detection scenarios, it is worth noting the high Mbps/kGE ratio of the presented FSE approach.

9. Conclusion
This paper has addressed the issues of a fullypipelined implementation of the FSE and Kbest treesearch approaches for a VP system. The sorting stages required by the Kbest scheme have been performed by means of the WPE distributed sorting strategy along with a nonsequential complexplane enumerator, which has also been incorporated into the FSE structure to determine the child nodes to be expanded in those tree levels where . The design parameters that establish the performancecomplexity tradeoff of these nonrecursive treesearch approaches have been set so as to yield a similar count of allocated embedded multipliers. Additionally, the use of an approximate norm to reduce the computational complexity of the PED calculations has been contemplated.
Provided performance results have shown a closetooptimal performance and a very high achievable throughput in the range of 5 Gbps for both techniques. Nevertheless, the errorrate performance of the FSE has been shown to considerably outperform the Kbest in the highSNR range. Additionally, the provided FPGA resource occupation results have demonstrated the greater efficiency of the FSE architecture when compared to the Kbest fixedcomplexity structure.
Due to the good performance, occupation results, and simplicity of implementation, it is concluded that the FSE is best suited for the practical implementation of fixedcomplexity and highthroughput vector precoders.
Acknowledgments
The authors would like to thank the Department of Education, Universities and Research and the Department of Industry, Trade and Tourism of the Basque Government.
References
 B. M. Hochwald, C. B. Peel, and A. L. Swindlehurst, “A vectorperturbation technique for nearcapacity multiantenna multiuser communication, part II: perturbation,” IEEE Transactions on Communications, vol. 53, no. 3, pp. 537–544, 2005. View at: Publisher Site  Google Scholar
 C. Windpassinger, R. F. H. Fischer, and J. B. Huber, “Latticereductionaided broadcast precoding,” IEEE Transactions on Communications, vol. 52, no. 12, pp. 2057–2060, 2004. View at: Publisher Site  Google Scholar
 A. K. Lenstra, H. W. Lenstra, and L. Lovász, “Factoring polynomials with rational coefficients,” Mathematische Annalen, vol. 261, no. 4, pp. 515–534, 1982. View at: Publisher Site  Google Scholar
 L. Babai, “On lovász' lattice reduction and the nearest lattice point problem,” Combinatorica, vol. 6, no. 1, pp. 1–13, 1986. View at: Publisher Site  Google Scholar
 D. Seethaler and G. Matz, “Efficient vector perturbation in multiantenna multiuser systems based on approximate integer relations,” in Proceedings of the EURASIP European Signal Processing Conference (EUSIPCO '06), pp. 1–5, September 2006. View at: Google Scholar
 S. Hur, N. Kim, H. Park, and J. Kang, “Enhanced latticereductionbased precoder with list quantizer in broadcast channel,” in Proceedings of the IEEE 66th Vehicular Technology Conference (VTC '07), pp. 611–615, October 2007. View at: Publisher Site  Google Scholar
 F. Liu, L. Jiang, and C. He, “Low complexity MMSE vector precoding using lattice reduction for MIMO systems,” in Proceedings of the IEEE International Conference on Communications (ICC '07), pp. 2598–2603, June 2007. View at: Publisher Site  Google Scholar
 M. Taherzadeh, A. Mobasher, and A. K. Khandani, “LLL latticebasis reduction achieves the maximum diversity in MIMO systems,” in Proceedings of the IEEE International Symposium on Information Theory (ISIT '05), pp. 1300–1304, September 2005, maximum diversity; MIMO fading channels; MIMO broadcast systems; latticereductionaided decoding; pointtopoint system;multipleaccess system. View at: Publisher Site  Google Scholar
 M. Taherzadeh, A. Mobasher, and A. K. Khandani, “Communication over MIMO broadcast channels using latticebasis reduction,” IEEE Transactions on Information Theory, vol. 53, no. 12, pp. 4567–4582, 2007. View at: Publisher Site  Google Scholar
 K. H. Lin, H. L. Lin, R. C. Chang, and C. F. Wu, “Hardware architecture of improved TomlinsonHarashima Precoding for downlink MCCDMA,” in Proceedings of the IEEE Asia Pacific Conference on Circuits and Systems (APCCAS '06), pp. 1200–1203, December 2006. View at: Publisher Site  Google Scholar
 A. Burg, D. Seethaler, and G. Matz, “VLSI implementation of a latticereduction algorithm for multiantenna broadcast precoding,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS 07), pp. 673–676, May 2007. View at: Google Scholar
 P. Bhagawat, W. Wang, M. Uppal et al., “An FPGA implementation of dirty paper precoder,” in Proceedings of the IEEE International Conference on Communications (ICC '07), pp. 2761–2766, June 2008. View at: Google Scholar
 L. G. Barbero and J. S. Thompson, “Rapid prototyping of a fixedthroughput sphere decoder for MIMO systems,” in Proceedings of the IEEE International Conference on Communications (ICC '06), vol. 7, pp. 3082–3087, June 2006. View at: Google Scholar
 L. G. Barbero and J. S. Thompson, “FPGA design considerations in the implementation of a fixedthroughput sphere decoder for MIMO systems,” in Proceedings of the International Conference on Field Programmable Logic and Applications (FPL '00), pp. 1–6, August 2006. View at: Google Scholar
 L. G. Barbero and J. S. Thompson, “Extending a fixedcomplexity sphere decoder to obtain likelihood information for turboMIMO systems,” IEEE Transactions on Vehicular Technology, vol. 57, no. 5, pp. 2804–2814, 2008. View at: Publisher Site  Google Scholar
 K. W. Wong, C. Y. Tsui, R. S. K. Cheng, and W. H. Mow, “A VLSI architecture of a kbest lattice decoding algorithm for MIMO channels,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '02), vol. 3, pp. 273–276, May 2002. View at: Google Scholar
 M. Wenk, M. Zellweger, A. Burg, N. Felber, and W. Fichtner, “Kbest MIMO detection VLSI architectures achieving up to 424 Mbps,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '06), pp. 1151–1154, September 2006. View at: Google Scholar
 Q. Li and Z. Wang, “Improved kbest sphere decoding algorithms for MIMO systems,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '06), pp. 1159–1162, May 2006. View at: Google Scholar
 Z. Guo and P. Nilsson, “Algorithm and implementation of the kbest sphere decoding for MIMO detection,” IEEE Journal on Selected Areas in Communications, vol. 24, no. 3, pp. 491–503, 2006. View at: Publisher Site  Google Scholar
 M. Shabany and P. G. Gulak, “Scalable VLSI architecture for kbest lattice decoders,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '08), pp. 940–943, May 2008. View at: Publisher Site  Google Scholar
 C. A. Shen and A. M. Eltawil, “A radius adaptive kbest decoder with early termination: algorithm and VLSI architecture,” IEEE Transactions on Circuits and Systems I, vol. 57, no. 9, pp. 2476–2486, 2010. View at: Publisher Site  Google Scholar
 S. Chen, T. Zhang, and Y. Xin, “Relaxed kbest MIMO signal detector design and VLSI implementation,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 15, no. 3, pp. 328–337, 2007. View at: Publisher Site  Google Scholar
 M. Mahdavi, M. Shabany, and B. V. Vahdat, “A modified complex kbest scheme for highspeed hardoutput MIMO detectors,” in Proceedings of the 53rd IEEE International Midwest Symposium on Circuits and Systems (MWSCAS '10), pp. 845–848, August 2010. View at: Publisher Site  Google Scholar
 C. A. Shen, A. M. Eltawil, and K. N. Salama, “Evaluation framework for kbest sphere decoders,” Journal of Circuits, Systems and Computers, vol. 19, no. 5, pp. 975–995, 2010. View at: Publisher Site  Google Scholar
 D. A. Schmidt, M. Joham, and W. Utschick, “Minimum mean square error vector precoding,” in Proceedings of the IEEE 16th International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC '05), vol. 1, pp. 107–111, September 2005. View at: Google Scholar
 E. Viterbo and J. Boutros, “A universal lattice code decoder for fading channels,” IEEE Transactions on Information Theory, vol. 45, no. 5, pp. 1639–1642, 1999. View at: Google Scholar
 M. O. Damen, H. El Gamal, and G. Caire, “On maximumlikelihood detection and the search for the closest lattice point,” IEEE Transactions on Information Theory, vol. 49, no. 10, pp. 2389–2402, 2003. View at: Publisher Site  Google Scholar
 C. P. Schnorr and M. Euchner, “Lattice basis reduction: improved practical algorithms and solving subset sum problems,” in Proceedings of the International Symposium on Fundamentals of Computation Theory (FCT '91), vol. 529, pp. 68–85, September 1991. View at: Google Scholar
 J. Zhang and K. J. Kim, “Nearcapacity MIMO multiuser precoding with QRDM algorithm,” in Proceedings of the 39th Asilomar Conference on Signals, Systems and Computers (ACSSC '05), vol. 1, pp. 1498–1502, November 2005. View at: Google Scholar
 R. Habendorf and G. Fettweis, “Vector precoding with bounded complexity,” in Proceedings of the 8th IEEE Signal Processing Advances in Wireless Communications (SPAWC '07), pp. 1–5, June 2007. View at: Google Scholar
 M. Barrenechea, M. Mendicute, J. Del Ser, and J. S. Thompson, “Wiener filterbased fixedcomplexity vector precoding for the MIMO downlink channel,” in Proceedings of the IEEE 10th Workshop on Signal Processing Advances in Wireless Communications (SPAWC '09), pp. 216–220, ita, June 2009. View at: Publisher Site  Google Scholar
 M. Barrenechea, M. Mendicute, I. Jimenez, and E. Arruti, “Implementation of complex enumeration for multiuser mimo vector precoding,” in Proceedings of the EURASIP European Signal Processing Conference (EUSIPCO '11), pp. 739–743, August 2011. View at: Google Scholar
 P. Y. Tsai, W. T. Chen, X. C. Lin, and M. Y. Huang, “A 4 × 4 64QAM reducedcomplexity kbest MIMO detector up to 1.5 Gbps,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '10), pp. 3953–3956, May 2010. View at: Publisher Site  Google Scholar
 S. Mondal, W. H. Ali, and K. N. Salama, “A novel approach for kbest MIMO detection and its VLSI implementation,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '08), pp. 936–939, May 2008. View at: Publisher Site  Google Scholar
 S. Mondal, A. M. Eltawil, and K. N. Salama, “Architectural optimizations for lowpower kbest MIMO decoders,” IEEE Transactions on Vehicular Technology, vol. 58, no. 7, pp. 3145–3153, 2009. View at: Publisher Site  Google Scholar
 S. Mondal, A. Eltawil, C. A. Shen, and K. N. Salama, “Design and implementation of a sortfree Kbest sphere decoder,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 18, no. 10, pp. 1497–1501, 2010. View at: Publisher Site  Google Scholar
 M. Barrenechea, Design and implementation of multiuser mimo precoding algorithms [Ph.D. dissertation], University of Mondragon, Mondragon, Spain, 2012.
 A. Burg, M. Wenk, M. Zellweger, M. Wegmueller, N. Felber, and W. Fichtner, “VLSI implementation of the sphere decoding algorithm,” in Proceedings of the 30th European SolidState Circuits Conference (ESSCIRC '04), pp. 303–306, September 2004. View at: Google Scholar
Copyright
Copyright © 2013 Maitane Barrenechea et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.