Research Article  Open Access
Distributed Arithmetic for Efficient BaseBand Processing in RealTime GNSS Software Receivers
Abstract
The growing market of GNSS capable mobile devices is driving the interest of GNSS software solutions, as they can share many system resources (processor, memory), reducing both the size and the cost of their integration. Indeed, with the increasing performance of modern processors, it becomes now feasible to implement in software a multichannel GNSS receiver operating in real time. However, a major issue with this approach is the large computing resources required for the baseband processing, in particular for the correlation operations. Therefore, new algorithms need to be developed in order to reduce the overall complexity of the receiver architecture. Towards that aim, this paper first introduces the challenges of the software implementation of a GPS receiver, with a main focus given to the baseband processing and correlation operations. It then describes the already existing solutions and, from this, introduces a new algorithm based on distributed arithmetic.
1. Introduction
With the increasing performance of modern processors, it becomes now feasible to implement a realtime multichannel GNSS receiver in software (i.e., where all the basic baseband operations such as the correlation are performed on a general purpose microprocessor). However, a major issue with the software approach is the large computing resources required for the baseband processing. To illustrate this issue, let us consider a conventional baseband GPS architecture as shown in Figure 1.
In this architecture, the incoming satellite signal is sequentially processed at the system sampling rate for () residual carrier removal, () PRN code removal, and () integration and dumping.
In addition to unavoidable load, and store operations, Table 1 provides a rough estimate of the amount of integer additions and multiplications per second necessary to process satellites with the architecture of Figure 1 (without considering carrier and code generation).

From Table 1, a 12channel receiver operating at โMHz requires approximately additions and multiplications to be executed each second. Consequently, as several former studies have concluded (see, e.g., [1]), a straightforward transposition of standard hardwarebased architectures into software leads to an amount of realtime operations that can difficultly be managed by even todayโs fastest computers. In that sense, new algorithms or architectures have to be developed in order to minimize the computational load for the baseband processing, in particular for the correlation operations. To overcome this, two main strategies have been proposed in the literature. The first one relies on the use of Single Instruction Multiple Data (SIMD) operations while the second consists in exploiting the bitwise representation of the incoming signal. Both approaches are discussed hereafter.
2. Single Instruction Multiple Data (SIMD) Operations
In 1995, Intel introduced the first instance of SIMD under the name of MMX. The SIMD are mathematical instructions that operate on vectors of data and perform integer arithmetic on eight 8bit, four 16bit, or two 32bit integers packed into an MMX register. Unlike standard instructions, also sometimes referred as Single Instruction Single Data (SISD), the data are manipulated in blocks and a number of values can be loaded simultaneously, as illustrated in Figure 2.
On average, the SIMD operations require more clock cycles than the traditional operations. However, since they operate on multiple integer values at the same time, SIMD operations can result in a significant gain in execution speed, especially for repetitive and parallel tasks like the baseband processing ones. Similarly, Digital Signal Processors (DSPs) can also offer great code optimization possibilities as some of them are capable of performing several multibit multiplications in parallel. However, both SIMD operations and DSP are tied to very specific hardware implementations which severely limit the portability of the code. In order to maximize the flexibility of the receiver, this kind of solutions is not further considered in this document.
3. Bitwise Processing
Contrary to the SIMD operations, the bitwise processing (sometimes also referred in the literature as vector processing) uses a universal CPU instructions set and exploits the native bit representation of the signal. The data bits are stored in separate vectorsโgenerally one sign and one or several magnitude wordsโon which bitwise parallel operations can be performed independently. The objective is to take advantage of the high parallelism and speed of the bitwise operations for which a single integer operation is translated into a few simple parallel logical relations.
Consequently, while the integer arithmetic interprets the data horizontally as a single word, the bitwise processing manipulates the bits separately in a vertical way, as illustrated in Figure 3.
Many software receivers (see, e.g., [3, 4]) exploit the bitwise processing. Depending on the configuration, the code and carrier mixing can be carried out by a few basic logical relations in parallel on several samples, making the architecture particularly efficient. However, as the data bits are vertically spread over several sign and magnitude words, a reconversion into the integer representation is finally required to perform the accumulation and the data readout. This is done at the cost of numerous bitwise operations that needlessly increase the complexity. In conclusion, the inherent drawback of the bitwise processing is the lack of flexibility as the complexity becomes bitdepth dependant and may increase drastically with respect to the data quantization.
4. Distributed Arithmetic
The original concept of distributed arithmetic was developed for optimizing the implementation of digital filters into Field Programmable Gate Arrays (FPGAs). The main idea is to rearrange the multiplies and adds of a sum of products at the bit level to take advantage of small tables of precomputed sums (see, e.g., [5]). However this concept can also be adapted to a GNSS software receiver design in order to optimize the accumulations involved in the correlation process, as explained hereafter.
In Figure 1, the accumulation consists in summing up the consecutive samples (n), (n), and (n) and (n), (n), and (n) over the integration period as:
with being the number of samples per integration.
For notation simplicity in this paper, we only provide equations for the inphase () components (the same operations apply for the quadrature components). We also omit the index of and
Let us express the signal as a linear combination of its quantization bits. For the sake of simplicity, we consider here the twoโs complement notation which decomposes the signal as follows:
with (n) being the mth bit of the signal at the sampling instant .
We define the partial sum associated with the mth data bit of the signal as:
By combining and rearranging the terms of the above two equations, we obtain:
The accumulation is now expressed as a linear combination of partial sums . The challenge now consists in efficiently computing (3) for the bits of . Since is the arithmetic sum of all the bits contained in the word , it can be estimated by simply counting the number of bits equal to the logical value 1. Although some modern processors now propose an embedded instruction to perform this operation (which also limits the portability of the code), the most straightforward solution is to implement a LookUp Table (LUT) that is directly addressed by the word itself and that outputs the corresponding partial sum .
Thanks to the above distributed arithmetic implementation, the conversion from the bitwise representation into the integer one is performed in parallel to the accumulation. Furthermore the architecture can easily accommodate various signal configurations as the complexity stays almost proportional to the bitdepth of the incoming signal .
5. A GPS Implementation Example
Unlike the traditional bitwise approach, the distributed arithmetic requires the signal to accumulate to be expressed as a linear combination of its data bits (cf. (2)). This introduces an additional constraint on the former bitwise processing for carrier and code removal. To illustrate this aspect, let us consider the example of an incoming GPS satellite signal digitized with 2 bits per sample that are associated to the integer values 1 and 3.
In the receiver of Figure 1, is first mixed with a complex carrier quantized with 2 bits and associated to the integer values 1 and 2. The mixing of and results in the signals that can take one of the integer values shown in Table 2.

An appropriate binary representations for must be selected in order to minimize the number of bitwise operations necessary to perform the carrier and code mixing operations. In our example, bits are needed to represent in Table 2 if we use the following (heuristic) bits decomposition:
Using (5) and the sign and magnitude representation of and , Table 2 can be encoded as shown in Table 3.

We can now translate the truth table of Table 3 into the following logical equations that can be carried out with 7 operations or even 6 by storing one intermediate result:
Following the carrier mixing operations (cf. Figure 1), the code mixing simply consists in a sign inversion, respectively, noninversion, which can be translated into an exclusive OR between the code and all the respective signal bits :
With the above signal representation, the whole carrier and code mixing can be realized with 10 basic instructions that operate in parallel on 8, 16, 32, 64, or 128 bits depending on the CPU registers size .
From each of the words obtained with (7), the partial sum is calculated and summed up with the previous ones in order to form the final accumulation expressed as (using (4)):
As explained in the previous section, in order to save some operations, the partial sums are computed by the means of a LUT. The table must fit into the microprocessor cache to allow fast execution but must be also large enough to minimize the memory accesses.
6. Performances Comparison
The efficiency of the proposed distributed arithmetic depends on the register size R of the host computer (i.e., bits can be processed in parallel). For the previous 2bit data example, the total amount of operations becomes as shown in Table 4.

With respect to the integer baseband processing of Figure 1, the best improvement lies in the quasiabsence of integer multiplications, advantageously substituted by parallel logical operations. This way and assuming a 16bit CPU and a 2bit data quantization, the amount of integer additions to perform the correlation is reduced by almost 40%.
In comparison to the conventional bitwise processing, (2) may require more logical operations for the carrier and code mixing. On the other hand, no additional stage is needed to convert from bitwise into integer representation. Thus, in the case of a 2bit data configuration as proposed in [6], the distributed arithmetic lowers the complexity by almost a factor two. It becomes even more efficient for higher signal bitdepths setup as the complexity grows almost proportionally with the data bit quantization (while it increases exponentially in the standard implementation such as in [6]).
7. Conclusion
As the software implementation of standard baseband architectures is not really suitable for realtime operation, new approaches must be developed. While the bitwise processing represents a very interesting and popular alternative, taking advantage of the parallelism and universality of the basic logical CPU instructions, it still suffers from a lack of flexibility and scalability with respect to the integer operations. On the other hand, the distributed arithmetic constitutes the perfect bridge between bitwise and integer representations, combining the efficiency of the first with the flexibility of the latter. The concept is simple and elegant. This paper has demonstrated that for a standard configuration (2bit signal and carrier quantization), the amount of arithmetic operations is divided by nearly a factor of two with respect to both integer arithmetic and bitwise processing implementations. Finally, while our example only considers GNSS, it is also applicable to other types of receivers such as DSCDMA ones.
References
 G. W. Heckler and J.L. Garrison, Architecture of Reconfigurable Software Receiver, Purdue University, West Lafayette, Ind, USA, 2004.
 S. Charkhandeh et al., โImplementaztion and testing of a realtime softwarebased GPS receiver for x86 processors,โ in Proceedings of the ION National Technical Meeting (NTM '06), Monterey, Calif, USA, January 2006. View at: Google Scholar
 B. M. Ledvina et al., โRealtime software receiver,โ US patent application no. US0227856 A1, October 2006. View at: Google Scholar
 A. Fridman and S. Semenov, Architectures of Software GPS Receivers, Springer, Berlin, Germany, 2000.
 R. Andraka and A. Berkun, โFPGAs make a radar signal processor on a chip a reality,โ in Proceedings of the 33rd Asilomar Conference on Signals, Systems and Computers, vol. 1, pp. 559โ563, Pacific Grove, Calif, USA, October 1999. View at: Publisher Site  Google Scholar
 B. M. Ledvina, M. L. Psiaki, S. P. Powell, and P. M. Kintner, โBitwise parallel algorithms for efficient software correlation applied to a GPS software receiver,โ IEEE Transactions on Wireless Communications, vol. 3, no. 5, pp. 1469โ1473, 2004. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2010 Grégoire Waelchli et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.