Table of Contents
ISRN Signal Processing
Volume 2011, Article ID 615934, 8 pages
Research Article

Low-Complexity Inverse Square Root Approximation for Baseband Matrix Operations

1Department of Computer Systems, Tampere University of Technology, P.O. Box 553, 33101 Tampere, Finland
2Nokia Multimedia Imaging, Nokia Devices R&D, Nokia, Visiokatu 1, 33720 Tampere, Finland
33GP/DSE, ST-Ericsson, Hermiankatu 1 B, 33720 Tampere, Finland
4Nokia Devices R&D, Nokia, Elektroniikkatie 3, 90570 Oulu, Finland

Received 8 December 2010; Accepted 11 January 2011

Academic Editors: E. Salerno and E. D. Übeyli

Copyright © 2011 Perttu Salmela et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Baseband functions like channel estimation and symbol detection of sophisticated telecommunications systems require matrix operations, which apply highly nonlinear operations like division or square root. In this paper, a scalable low-complexity approximation method of the inverse square root is developed and applied in Cholesky and QR decompositions. Computation is derived by exploiting the binary representation of the fixedpoint numbers and by substituting the highly nonlinear inverse square root operation with a more implementation appropriate function. Low complexity is obtained since the proposed method does not use large multipliers or look-up tables (LUT). Due to the scalability, the approximation accuracy can be adjusted according to the targeted application. The method is applied also as an accelerating unit of an application-specific instruction-set processor (ASIP) and as a software routine of a conventional DSP. As a result, the method can accelerate any fixed-point system where cost-efficiency and low power consumption are of high importance, and coarse approximation of inverse square root operation is required.

1. Introduction

Ever higher data rates require sophisticated transmission techniques but, on the other hand, the latest technologies allow use of advanced and more complex receiver algorithms. Such algorithms apply matrix operations which require highly nonlinear division by square root operation. For example, linear minimum mean square error (LMMSE) estimation has been proposed for the receivers of the current 3G, Universal Mobile Telecommunications System [1], and Cholesky decomposition can be used for the inevitable matrix inversion or for solving linear systems. In the upcoming 3G Long-Term Evolution (LTE) systems, multiple-input multiple-output (MIMO) receivers require demanding symbol detection methods like list sphere detector (LSD), which applies QR decomposition of a complex-valued channel matrix. When compared to matrix computation studies targeted for processing large matrices with highly parallel resources [2, 3], there are four notable differences in the baseband processing in the telecommunications field: (i)the matrices are relatively small, (ii)fixed-point number system is preferred, (iii)there are real-time limits, (iv)low complexity and low power consumption are of high importance.

In this paper, a low-complexity inverse square root approximation method is proposed for baseband matrix operations. The method relies on binary presentation of the fixed-point number system and it avoids large LUTs, large multipliers, and floating-point arithmetic units. The principal idea of the method is to substitute the highly nonlinear inverse square root function with a less nonlinear function with appropriate pre- and postprocessing. The accuracy and complexity of the method can be adjusted with one design parameter. Thus, the method lends itself to lower-complexity applications where coarse approximations and fixed-point computations are preferred. In addition to comparison of hardware implementations of inverse square root methods, we show how the proposed method can be applied as a software routine or as an accelerating unit of an ASIP. The implementations are applied for Cholesky and QR decompositions required by 3G and 3G LTE receivers.

2. Previous Work

There are several methods to compute the inverse square root function. One of the basic approaches is to use lookup tables (LUT) for obtaining an initial value for iterations, which refine the value to higher accuracy [4, 5]. The main differences among these kinds of methods are in the size and content of LUT and the used iteration algorithm. In [5], a large multiplier was used since it was available in the targeted general purpose processor. In [6], savings were obtained by using a 𝑚×𝑛 multiplier, 𝑚𝑛, and utilizing the fact that less significant bits of intermediate result do not contribute to the accuracy of the final result. A software implementation using LUT initialization followed by iterations was presented in [7]. Another software approximation in [8] relied heavily on the binary representation of floating-point numbers.

LUTs using low-order polynomial approximation were applied in [9]. Polynomial approximation was also used in [10] where a second-degree minimax polynomial approximation was followed by modified Goldschmidt iteration. A comparison considering area costs was also given. Digit recurrence methods were proposed in [11, 12]. The main disadvantage of using digit recurrence when compared to iterative algorithms is their linear convergence. Approximation based on LUT followed by multiplication with operand modification was proposed in [13, 14] and used also in [15]. Argument reduction followed by series expansion was applied in [16]. Another approach is to work in logarithmic domain [17, 18] where the computation of the inverse square root is straightforward [19, 20].

For shorter word lengths (WLs) and for using fixed-point numbers, table addition methods have been proposed. These methods consist of parallel LUTs and multioperand additions. As a benefit, no multipliers are required. In [21], a symmetric table addition method (STAM) was developed as an extension to a simpler bipartite method. Selecting appropriate multipartite method, that is, design space exploration, was considered in [22]. The STAM enhanced with an error correction term and internal presentation in exponent and mantissa form was used in [23].

When compared to the previously mentioned methods, the proposed method in this paper is novel, that is, it is not a derivative of any of the existing methods. The area costs are kept at low as large LUTs and large multipliers are avoided. The proposed method lends itself also to software implementation. Furthermore, the proposed method can be adjusted to work only in subunitary range, which is sufficient for, for example, Cholesky decomposition, and the accuracy of the method can be adjusted along with the complexity up to a certain level while maintaining high area efficiency.

3. Targeted Matrix Decompositions in Baseband Processing

In this section, we describe where the targeted low-complexity inverse square root operations have been applied.

3.1. Baseband Processing with Fixed-Point Number System

As the baseband functions are applied in receivers of, for example, handheld telecommunications devices, low complexity is important for decreasing the area costs and power consumption. Therefore, fixed-point number system is preferred, that is, limited accuracy is applied. In this study, the targeted fractional word length (FWL) is 11 bits and integer word length (IWL) is 5 bits, that is, 16-bit words are assumed.

Targeted matrix operations are illustrative examples of baseband functions for two reasons. Firstly, the computations consist mainly of massive repetitions of a single operation, which is multiply and accumulate in this case. Secondly, an efficient mapping of computations to custom hardware or DSP is prevented by one less frequently used, but demanding, operation, which is inverse square root, 1/𝑥, in this case. Thus, there is a realistic need for low-complexity, limited accuracy implementations of 1/𝑥 function.

3.2. Cholesky Decomposition for LMMSE

The LMMSE estimation of transmitted data vector applies typically the Cholesky decomposition. Basically, the LMMSE estimate, ̂𝐝, can be expressed as a function of received data, 𝐲, channel matrix, 𝐌, and power of noise, 𝜎2:̂𝐝=𝐌𝐻𝐌+𝜎2𝐈1𝐌𝐻𝐲,(1) where it is assumed that autocorrelation 𝐸(𝐝𝐝𝐻)=𝐈. The estimation in (1) can be derived into a form, which can be presented as a linear system with positive definite real-valued matrix. With Cholesky decomposition 𝐀=𝐋𝑇𝐋, such a linear system, 𝐀𝐱=𝐛, can be solved with the aid of two triangular systems, that is, 𝐋𝑇𝐳=𝐛 and 𝐋𝐱=𝐳.

Diagonal elements, 𝑙𝑖𝑖, of Cholesky factor 𝐋 are defined as𝑙𝑖𝑗=𝑎𝑖𝑖𝑖1𝑘=1𝑙2𝑖𝑘(2) and nondiagonal elements, 𝑙𝑖𝑗, as𝑙𝑖𝑗=1𝑙𝑖𝑗𝑎𝑖𝑗𝑗1𝑘=1𝑙𝑖𝑘𝑙𝑗𝑘,(3) where 𝑎𝑖𝑗 denotes the elements of 𝐀. Equations (2) and (3) show that nondiagonal elements require division by square root and diagonals square root operations. Thus, the division by square root can be replaced with multiplication with the inverse square root, that is, two demanding operations are substituted with one demanding and one less demanding operation. The square root operation of (2) can also be computed with an additional multiplication, as 𝑥=𝑥/𝑥. An important property of Cholesky decomposition is the preservation of the subunitary of matrix elements, which limits the arguments of 1/𝑥 operations efficiently.

3.3. QR Decomposition for LSD

The LSD is used in MIMO receivers to estimate the transmitted symbol, 𝐬, by approximating maximum likelihood detection: 𝐬=argmin𝐬𝐲𝐇𝐬2,(4) where 𝐲 is the received symbol vector and 𝐇 is complex-valued channel matrix whose dimensions are equal to the number of transmit and receive antennas of MIMO system. The approximation is based on substitution with QR decomposition 𝐐𝐑=𝐇, that is,𝐬=argmin𝐬𝐲𝐑𝐬2,where𝐲=𝐐𝐻𝐲.(5) The LSD approximates (5) by gradually increasing the dimensions of symbol vector and computing partial Euclidean distances. With this practice, the search space can be limited efficiently.

QR factorization with modified Gram-Schmidt algorithm [24] is presented in Algorithm 1.

Algorithm 1:

It decomposes 𝐇𝑛×𝑛 to the orthogonal 𝐐𝑛×𝑛 and upper triangular 𝐑𝑛×𝑛. Conjugated transpose is denoted with ()𝐻. The lines 2 and 3 show that division by square root is required as the elements are divided by diagonals which are norms, . In a similar way as with Cholesky decomposition, the division can be substituted with multiplication by inverse square root.

4. Low-Complexity Approximation Method

The main principle of the proposed method is to avoid straightforward approximation of 1/𝑥 function which is highly nonlinear in subunitary range 0<𝑥1. Instead, the more softly nonlinear function 1/𝑐+𝑢 with 𝑐1 and 0<𝑢1 is approximated. The usage of 1/𝑐+𝑢 is justified by the following fixed-point representations in two complement formats of 𝑥, 𝑐, and 𝑢. If the positive subunitary 𝑥 has 𝛼 leading zeros, 𝑐 and 𝑢 can be defined so that𝑥=0.000𝛼𝑐𝑁1𝑐𝑁2𝑐0𝑢𝑀1𝑢𝑀2𝑢0.(6) In other words, the bits of 𝑐 and 𝑢 do not overlap and the word lengths of 𝑐 and 𝑢 are denoted with 𝑁 and 𝑀, respectively. Positive nonsubunitary range, 𝑥>1, is presented similarly, except that the number of leading zeros, 𝛼, can have negative values. Since 𝑐𝑁1=1 for all valid values of 𝑥, the 𝑥 can be presented with the aid of shifting by 𝛼, that is,𝑥×2𝛼=1𝑐𝑁2𝑐0𝑢𝑀1𝑢𝑀2𝑢0,𝑥=2𝛼×1𝑐𝑁2𝑐0𝑢𝑀1𝑢𝑀2𝑢0.(7) Thus, the desired form, 𝑐+𝑢, can be obtained, and we note that 𝑢 is a positive subunitary number. The targeted function can be written as1𝑥=12𝛼(𝑐+𝑢)=2𝛼/21𝑐+𝑢.(8)

We can distinguish two cases depending on the value of 𝛼, which represents the number of leading zeros of fixed-point binary representation of 𝑥 (6). This distinct behavior is obtained because the remainder of 𝛼/2 in (8) can be either zero or one. For even values 𝛼=2𝑘,1𝑥=2𝑘1𝑐+𝑢(9) and, for odd values 𝛼=2𝑘+1,1𝑥=2𝑘21𝑐+𝑢.(10)

In order to approximate (9) or (10), the expression 1/𝑐+𝑢 must be considered. A tempting solution is to approximate 1/𝑐+𝑢 with binomial series. In principle, the 1/𝑐+𝑢 could be approached with arbitrarily high precision, as the binomial series converges. Multipliers are required if polynomial approximation [9, 10] or series expansion [16] is applied. Although the approximation with binomial series has a solid basis, it does not lend itself to low-complexity implementations due to the high-order terms.

4.1. Linear Approximation

We attempt to identify the characteristic of 1/𝑐+𝑢 and to determine a first-degree polynomial that will give the smallest approximation error for a low-complexity hardware implementation. So, we will approximate the expression 1/𝑐+𝑢 with straight lines, that is,1𝑐+𝑢𝑎𝑡𝑏𝑡(𝑐+𝑢),(11) where 𝑎𝑡,𝑏𝑡>0 and subscript 𝑡 is the integer interpretation of the concatenation 𝑐𝑁2𝑐0𝛼0. The number of approximating lines, that is, the accuracy of the approximation, depends on the WL of 𝑐. Since the MSB of 𝑐 has always constant value, 𝑐𝑁1=1, the number of approximating lines is 2𝑁.

The range of the targeted expression is 11/𝑐+𝑢1/2, since 1𝑐+𝑢<2. The domain of 𝑐 is defined by the WL, 𝑁, that is, 1𝑐2(12𝑁). Naturally, the domain of 𝑢 depends on 𝑁 and 𝑀, that is, 0𝑢2(𝑁1)2(𝑀+𝑁1). In practice, the approximating lines are formed by dividing the domain of 𝑐+𝑢 into evenly spaced regions, which are determined by the 𝑁 highest bits of 𝑐. The values in the start and end points are given by 1/𝑐 and the value of the last end point is 1/2. The linear approximation is illustrated in Figure 1(a) where 1/𝑐+𝑢, with even 𝛼 approximated with 𝑁=1,2,3. The error of approximation is shown in Figure 1(b). The figures indicate that by increasing the word length 𝑁, the accuracy of the approximation can be adjusted conveniently.

Figure 1: Linear approximation of 1/𝑐+𝑢: (a) approximating lines; (b) approximation error decreases as 𝑁 is increased.

For odd values 𝛼=2𝑘+1, the 2/𝑐+𝑢 is approximated in a similar way. The lines used for even values, 𝛼=2𝑘, cannot be used without multiplication with 2, and, therefore, different approximating lines are preferred. To obtain the final result, that is, the approximation of 1/𝑥, the approximating straight lines in must be scaled with 2𝑘 as shown in (9) and (10). The scaling can be carried out easily with shift operation, whose direction depends on the sign of 𝛼.

4.2. Coefficients for Hardware Implementation

The linear approximation has the form 𝑎𝑡𝑏𝑡(𝑐+𝑢), which includes multiplication. However, for obtaining low complexity, the multiplications should be avoided. Braun multiplier adds shifted values of the multiplicand multiplied with one bit of the multiplier. The principle of adding shifted values can be used to approximate the product 𝑏𝑡(𝑐+𝑢). Since 𝑏𝑡1/2, the product can be presented as𝑏𝑡(𝑐+𝑢)=𝑑1,𝑡𝑐+𝑢21+𝑑2,𝑡𝑐+𝑢22++𝑑𝑀+𝑁1,𝑡𝑐+𝑢2𝑀+𝑁1,(12) where 𝑑𝑖,𝑡{1,0,1}. As division with powers of two can be implemented with hardwired shifting in hardware, an approximation of the previous form is suitable for low-complexity implementation. Naturally, the accuracy depends on the number of terms included in the sum. In the proposed method, at maximum three terms are included, that is, an approximation,𝑏𝑡(𝑐+𝑢)𝑑1,𝑡𝑐+𝑢2𝑒1,𝑡+𝑑2,𝑡𝑐+𝑢2𝑒2,𝑡+𝑑3,𝑡𝑐+𝑢2𝑒3,𝑡,(13) in which 𝑑𝑖,𝑡{1,0,1} and 𝑒𝑖,𝑡{1,,8}, is used. The coefficients 𝑑𝑖,𝑡 and 𝑒𝑖,𝑡 are searched for each approximating line, that is, for each 𝑐 and 𝛼0, separately. Instead of three shifters with freely variable shift count, three multiplexers can be used to select appropriate terms.

5. Inverse Square Root Unit Implementations

The block diagrams of the hardware implementations of the inverse square root units are shown in Figure 2. Figure 2(a) shows only the linear approximation 𝑎𝑡𝑏𝑡(𝑐+𝑢). The top three multiplexers correspond with term 𝑏𝑡(𝑐+𝑢), and the fourth multiplexer outputs 𝑎𝑡. The selections of multiplexer are controlled by parity of 𝛼 and bits of 𝑐 excluding the 𝑐𝑁1 which has constant value.

Figure 2: Units for (a) linear approximation with 𝑎𝑡𝑏𝑡(𝑐+𝑢), (b) approximation of 1/𝑥 in subunitary range, and (c) approximation in nonsubunitary range. Left shifting and right shifting are denoted with and , respectively. Negation is marked with (1).

In the next block diagram in Figure 2(b) the previous unit is instantiated in the inverse square root unit. The range of the unit in Figure 2(b) is positive subunitary, that is, 0<𝑥1, which is sufficient for the Cholesky decomposition. The structure is further extended in Figure 2(c) to allow free range, that is, 𝑥>0. Basically, nonsubunitary range of 𝑥 results also in negative values of 𝛼, and, therefore, both left shifting and right shifting are required as indicated in Figure 2(c). In practice, shifters consists of hardwired shift operations from which one is selected with multiplexer. Therefore, a combination of left and right shifters can be assumed to have the same complexity of unidirectional shifter with respectively wider range of shifted bits. As the input signal 𝑥 has wider WL in Figure 2(c), the negative 𝛼 is detected by comparing the number of leading zeros and IWL.

Only basic arithmetic and logic units are being used. The key components are priority encoder, adders, multiplexers, and shifters. Part of the functionality, for example, constant scaling, is implemented by hardwiring bits to the new positions. Due to the scaling, WLs of intermediate signals are relatively short. As the targeted accuracy depends on 𝑁, different implementations can be generated according to targeted application. Figure 2(a) shows a general case, that is, the number of inputs of multiplexers and 𝑁 are free variables. In Figures 2(b) and 2(c)  𝑁=1 and, therefore, multiplexers are controlled solely by 𝛼0. If 𝑁>1, the 𝑐 is obtained from the output of the first shifter(s) and the control signal is generated by concatenation of 𝑐𝑁2𝑐0 and 𝛼0. Only the structure of linear approximation depends on 𝑁, and the other components in Figures 2(b) and 2(c) remain unaltered if 𝑁 is increased.

6. Comparisons

Areas of the proposed method and competitive methods are estimated for a suggestive comparison. The proposed method is synthesized with 130 nm technology. The areas of other methods are estimated by considering their most expensive area components, such as multipliers and LUTs, unless more accurate details are clearly specified in the referred design. Only the mantissa of floating-point implementations is considered since its computation is similar in fixed-point number system.

6.1. Estimation of Area

Areas in terms of logic gate equivalents (GEs) of the synthesized arithmetic and logic operations with different WLs are given in Table 1. Since the basic unit of area is one NAND gate, fractions are possible. On the contrary to simple cost estimation of LUTs in [10], we have estimated the area of all LUTs individually. If structures of LUTs are not specified in detail, fair assumptions are made for the referred works. The synthesized LUTs are filled with random bits. The main reason for accurate modeling of LUT complexity is that the relative area of LUT depends both on the address line width and data WL. Estimated areas of all the LUTs are given in Table 1.

Table 1: Estimated area in terms of GEs of arithmetic and logic units and LUTs in compared implementations.
6.2. Compared Implementations

Since low area is emphasized in the targeted application domain of baseband processing, the methods are compared using the area efficiency as the ratio of accuracy versus area. The metric is defined as areaeciency=accuracyinbitsareainGEs.(14) For single precision (SP) methods the accuracy is 23 bits and for dual precision (DP) methods 52 bits. The area efficiency results for all the methods are shown in Table 2. The average accuracy of the proposed method in Table 2 is obtained in the subunitary range. Nonsubunitary range would increase the average accuracy even further. There are four versions of the proposed method with design parameter 𝑁=1,2,3,4. The results show that the proposed method has the lowest area, and even if the accuracy is adjusted with 𝑁, the area efficiency remains the highest except with 𝑁=1. Naturally, the accuracy is relatively modest, as we have preferred the lowest area instead of high accuracy.

Table 2: Suggestive comparison of inverse square root methods. The proposed method has the highest area efficiency with 𝑁=2,3,4.

The first method in Table 2 was targeted to DP general purpose processor [5]. It required LUTs of sizes 210×23 and 211×23 and multiplier for 76×76 operands. Since the implementation was targeted to the general purpose processor, the hardware resources were not dedicated only to the inverse square root function. In [6] two 16×56 and one 56×56 multipliers were required. The total memory size was 72192 bits divided into four tables. For smaller gate count, we have assumed uniform division to four tables of 18048 bits with WL 22 bits, which is the widest word fetched from the tables. High speed was emphasized in [10]. Therefore, we compare with the method with single multiply and accumulate unit [10], which had better area efficiency. The authors also reported the complexity of 5030 full adders and, therefore, their value is used instead of our own estimates. In [13], SP floating-point numbers were targeted. A 210×25 LUT was required and a 20×20 multiplier. In addition, a requirement of 15 inverters was reported. Symmetric table addition method (STAM) was used in [21]. The smallest total LUT size was obtained with four LUTs of sizes 29×19, 28×10, 28×8, and 28×6. In addition, a sum of all the data read from LUTs must be generated, which requires adders with operand sizes 19×10, 8×6, and 20×9. Also a requirement of 45 XOR gates was reported. Both SP and DP were targeted in [16] but the method for SP gave better area efficiency. The SP method required 27×36 LUT, four 4×4 multipliers, and one 22×23 multiplier. Fixed-point number systems were targeted in [23]. The method applied STAM enhanced with added correction value. Estimated complexity of 625 GE and LUT size of 3456 bits were given in [23]. Since the structures of LUTs were not reported, we have assumed that, due to the STAM, the memory is divided at least to two LUTs. We also assume 16-bit WL. Several smaller LUTs or shorter WL would degrade the area efficiency. The estimated complexity of LUTs is added to the reported gate count.

6.3. Power Consumption

The power consumption of the largest proposed unit (𝑁=4) with 100 MHz is 0.339 mW, which includes the power required by input and output registers. Naturally, the static power consumption is proportional to the area, and, therefore, low complexity has been targeted. The dynamic power is proportional both to the area and average switching activity of the gates. Even if the average switching activity of competitive methods cannot be estimated sufficiently accurately, the differences in the area are significant. For example, [23] has the smallest area, 1602 GE, of the referred methods and the average switching activity of [23] should get as low as 622/1602×100%=39% of the average switching activity of the largest proposed unit (622 GE, 𝑁=4) to achieve roughly the same dynamic power consumption.

7. Matrix Decomposition Implementation Case Studies

In this section, the proposed method for 1/𝑥 approximation is applied in QR and Cholesky decomposition implementations. Many matrix decompositions are implemented with systolic arrays [25] applying inherent regularity of the algorithm, and the computations are alleviated with CORDIC [26] algorithm. However, such structures are not as flexible as programmable processors and such high parallelism can easily result in so fast processing that the array processor must idle most of the time if applied, for example, in QR decomposition of MIMO receiver.

7.1. QR Decomposition with ASIP

Complex-valued QR decomposition was implemented with transport triggered architecture (TTA) [27] processor in [28]. The TTA is an ASIP template where parallel computing resources can be tailored according to the application. The proposed 1/𝑥 unit is instantiated in the processor as shown in Figure 3. In addition, there are units for complex addition and subtraction and complex multiplication with optional conjugation.

Figure 3: TTA processor for QR decomposition. Filled circles denote valid connections between resources and internal buses. The processor has special function units for complex-valued arithmetic and 1/𝑥 approximation.

Typically, MIMO systems have only a couple of transmit and receive antennas, and, therefore, a 4×4 matrix decomposition is targeted. The modified Gram-Schmidt algorithm requires 2𝑛3 operations for 𝑛×𝑛 matrix [24]. The processor implementation takes 139 clock cycles for 4×4 matrix. If 2048 subcarriers must be processed within a coherence time of the channel, 160 MHz clock frequency is adequate. The processor is synthesized with 130 nm technology and it takes 16.3 kGE with 160 MHz and 23.2 kGE with 269 MHz. The power consumption with 160 and 269 MHz clock frequencies is 6.91 mW and 16.79 mW, respectively.

7.2. Cholesky Decomposition with DSP

Cholesky decomposition was implemented as a software routine on TI's C55x DSP in [29]. Equations (2) and (3) show that the algorithm lends itself to the multiply and accumulate instruction, for which DSPs are typically optimized. Furthermore, an efficient hardware looping can be applied in the innermost loops as testing within the loop can be avoided. With the simple 1/𝑥 approximation the developed program decomposes 64×64 matrix in 85070 clock cycles.

Maintenance of a continuous flow of computations is more important for an efficient software implementation than focusing on avoidance of multiplications. In other words, pipeline should be kept full by avoiding conditional branching when possible. For example, describing computation of 𝛼 as defined in (6) in C language would result in a cumbersome loop testing bits of 𝑥. However, it can be avoided as the instruction set of the applied DSP has adequate assembly instruction for obtaining the number of leading zeros. Furthermore, short branches according to the parity of 𝛼 can be avoided with guarded instructions, that is, the computations proceed uninterrupted by both branches but only the other branch affects the state. Thus, the proposed method lends itself to an efficient software implementation on a DSP with adequate instructions for obtaining the number of leading zeros and for guarded execution.

8. Conclusions

The inverse square root function is highly nonlinear function and, therefore, approximated usually with high-complexity implementation. The proposed approximation method targets moderate precision fixed-point numbers. The computation has been based on an appropriate substitute, which allowed approximation without large LUTs and large multipliers. The method has one design parameter which allows scaling of the accuracy and hardware complexity. The area efficiency of the proposed method has been given in terms of approximation accuracy per area. Comparisons with previously reported methods show that the proposed method achieves low complexity and high area efficiency. Finally, the method has been applied on the targeted baseband functions as a function unit of an ASIP and as a software routine on a DSP.


  1. P. Darwood, P. Alexander, and I. Oppermann, “LMMSE chip equalization for 3GPP WCDMA downlink receivers with channel coding,” in Proceedings of the IEEE International Conference on Communications (ICC '01), vol. 5, pp. 1421–1425, Helsinki, Finland, June 2001.
  2. G. Baker, J. Gunnels, G. Morrow, B. Riviere, and R. van de Geijn, “PLAPACK: high performance through high-level abstraction,” in Proceedings of the International Conference on Parallel Processing (ICPP '98), pp. 414–422, Minneapolis, Minn, USA, August 1998.
  3. J. Demmel, “LAPACK: a portable linear algebra library for supercomputers,” in Proceedings of the IEEE Control Systems Society Workshop on Computer-Aided Control System Design (CACSD '89), pp. 1–7, Tampa, Fla, USA, December 1989.
  4. H. Kwan, R. L. Nelson, and E. E. Swartzlander, “Cascade implementation of an iterative inverse-square-root algorithm, with overflow lookahead,” in Proceedings of the IEEE 12th Symposium on Computer Arithmetic, pp. 115–122, Bath, UK, July 1995.
  5. S. F. Oberman, “Floating point division and square root algorithms and implementation in the AMD-K7 microsprocessor,” in Proceedings of the 14th IEEE Symposium on Computer Arithmetic, pp. 106–115, Adelaide, Australia, April 1999.
  6. W. F. Wong and E. Goto, “Fast hardware-based algorithms for elementary function computations using rectangular multipliers,” IEEE Transactions on Computers, vol. 43, no. 3, pp. 278–294, 1994. View at Publisher · View at Google Scholar · View at Scopus
  7. K. Turkowski, “Computing the inverse square root,” Tech. Rep. 95, Media Technologies: Computer Graphics Advanced Technology Group Apple Computer, Inc., October 1994. View at Google Scholar
  8. C. Lomont, “Fast inverse square root,” Tech. Rep., Department of Mathematics, Purdue University, West Lafayette, Ind, USA, February 2003. View at Google Scholar
  9. V. K. Jain, S. Shrivastava, A. D. Snider, D. Damerow, and D. Chester, “Hardware implementation of a nonlinear processor,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '99), vol. 6, pp. I-509–I-514, Orlando, Fla, USA, June 1999.
  10. J. A. Piñeiro and J. D. Bruguera, “High-speed double-precision computation of reciprocal, division, square root, and inverse square root,” IEEE Transactions on Computers, vol. 51, no. 12, pp. 1377–1388, 2002. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  11. N. Takagi, “A hardware algorithm for computing reciprocal square root,” in Proceedings of the 15th IEEE Symposium on Computer Arithmetic, pp. 94–100, Vail, Colo, USA, June 2001.
  12. T. Lang and E. Antelo, “Radix-4 reciprocal square-root and its combination with division and square root,” IEEE Transactions on Computers, vol. 52, no. 9, pp. 1100–1114, 2003. View at Publisher · View at Google Scholar · View at Scopus
  13. N. Takagi, “Generating a power of an operand by a table look-up and a multiplication,” in Proceedings of the 13th IEEE Symposium on Computer Arithmetic, pp. 126–131, Asilomar, Calif, USA, July 1997.
  14. N. Takagi, “Powering by a table look-up and a multiplication with operand modification,” IEEE Transactions on Computers, vol. 47, no. 11, pp. 1216–1222, 1998. View at Google Scholar · View at Scopus
  15. M. J. Schulte and K. E. Wires, “High-speed inverse square roots,” in Proceedings of the 14th IEEE Symposium on Computer Arithmetic, pp. 124–131, Adelaide, Australia, April 1999.
  16. M. D. Ercegovac, T. Lang, J. M. Muller, and A. Tisserand, “Reciprocation, square root, inverse square root, and some elementary functions using small multipliers,” IEEE Transactions on Computers, vol. 49, no. 7, pp. 628–637, 2000. View at Google Scholar · View at Scopus
  17. J. N. Coleman and E. I. Chester, “A 32-bit logarithmic arithmetic unit and its performance compared to floating-point,” in Proceedings of the 14th IEEE Symposium on Computer Arithmetic, pp. 142–151, Adelaide, Australia, April 1999.
  18. M. Haselman, M. Beauchamp, A. Wood, S. Hauck, K. Underwood, and K. S. Hemmert, “A comparison of floating point and logarithmic number systems for FPGAs,” in Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM '05), pp. 181–190, Napa, Calif, USA, April 2005. View at Publisher · View at Google Scholar
  19. C. H. Chen and C. Y. Lee, “Cost effective lighting processor for 3D graphics application,” in Proceedings of the International Conference on Image Processing (ICIP '99), vol. 2, pp. 792–796, Kobe, Japan, October 1999.
  20. A. Happonen, E. Hemming, and M. Juntti, “A novel coarse grain reconfigurable processing element architecture,” in Proceedings of the IEEE International Midwest Symposium on Circuits and Systems, vol. 3, pp. 827–830, Cairo, Egypt, December 2003.
  21. J. E. Stine and M. J. Schulte, “Symmetric table addition method for accurate function approximation,” Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology, vol. 21, no. 2, pp. 167–177, 1999. View at Publisher · View at Google Scholar · View at Scopus
  22. F. de Dinechin and A. Tisserand, “Some improvements on multipartite table methods,” in Proceedings of the 15th IEEE Symposium on Computer Arithmetic, pp. 128–135, Vail, Colo, USA, June 2001.
  23. K. Rounioja and J. A. Parviainen, “Arithmetic processing unit for reciprocal operations,” in Proceedings of the International Symposium on System-on-Chip (SoC '03), pp. 109–112, Tampere, Finland, November 2003.
  24. G. H. Golub, Matrix Computations, John Hopkins University Press, Baltimore, Md, USA, 1989.
  25. S. Y. Kung , VLSI Array Processors, Prentice-Hall, Upper Saddle River, NJ, USA, 1987.
  26. R. Andraka, “Survey of CORDIC algorithms for FPGA based computers,” in Proceedings of the ACM/SIGDA 6th International Symposium on Field Programmable Gate Arrays (FPGA '98), pp. 191–200, Monterey, Calif, USA,, February 1998.
  27. H. Corporaal, Microprocessor Architectures from VLIW to TTA, John Wiley & Sons, New York, NY, USA, 1998.
  28. P. Salmela, A. Burian, H. Sorokin, and J. Takala, “Complex-valued QR decomposition implementation for MIMO receivers,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '08), pp. 1433–1436, Las Vegas, Nev, USA, April 2008. View at Publisher · View at Google Scholar
  29. P. Salmela, A. Happonen, T. Järvinen, A. Burian, and J. Takala, “DSP implementation of Cholesky decomposition,” in Proceedings of the Joint 1st Workshop on Sensor Networks and Symposium on Trends in Communications, pp. 6–9, Bratislava, Slovakia, June 2006.