#### Abstract

The growing market of GNSS capable mobile devices is driving the interest of GNSS software solutions, as they can share many system resources (processor, memory), reducing both the size and the cost of their integration. Indeed, with the increasing performance of modern processors, it becomes now feasible to implement in software a multichannel GNSS receiver operating in real time. However, a major issue with this approach is the large computing resources required for the base-band processing, in particular for the correlation operations. Therefore, new algorithms need to be developed in order to reduce the overall complexity of the receiver architecture. Towards that aim, this paper first introduces the challenges of the software implementation of a GPS receiver, with a main focus given to the base-band processing and correlation operations. It then describes the already existing solutions and, from this, introduces a new algorithm based on distributed arithmetic.

#### 1. Introduction

With the increasing performance of modern processors, it becomes now feasible to implement a real-time multichannel GNSS receiver in software (i.e., where all the basic base-band operations such as the correlation are performed on a general purpose microprocessor). However, a major issue with the software approach is the large computing resources required for the base-band processing. To illustrate this issue, let us consider a conventional base-band GPS architecture as shown in Figure 1.

In this architecture, the incoming satellite signal is sequentially processed at the system sampling rate for () residual carrier removal, () PRN code removal, and () integration and dumping.

In addition to unavoidable load, and store operations, Table 1 provides a rough estimate of the amount of integer additions and multiplications per second necessary to process satellites with the architecture of Figure 1 (without considering carrier and code generation).

From Table 1, a 12-channel receiver operating at MHz requires approximately additions and multiplications to be executed each second. Consequently, as several former studies have concluded (see, e.g., [1]), a straightforward transposition of standard hardware-based architectures into software leads to an amount of real-time operations that can difficultly be managed by even today’s fastest computers. In that sense, new algorithms or architectures have to be developed in order to minimize the computational load for the base-band processing, in particular for the correlation operations. To overcome this, two main strategies have been proposed in the literature. The first one relies on the use of Single Instruction Multiple Data (SIMD) operations while the second consists in exploiting the bitwise representation of the incoming signal. Both approaches are discussed hereafter.

#### 2. Single Instruction Multiple Data (SIMD) Operations

In 1995, Intel introduced the first instance of SIMD under the name of MMX. The SIMD are mathematical instructions that operate on vectors of data and perform integer arithmetic on eight 8-bit, four 16-bit, or two 32-bit integers packed into an MMX register. Unlike standard instructions, also sometimes referred as Single Instruction Single Data (SISD), the data are manipulated in blocks and a number of values can be loaded simultaneously, as illustrated in Figure 2.

On average, the SIMD operations require more clock cycles than the traditional operations. However, since they operate on multiple integer values at the same time, SIMD operations can result in a significant gain in execution speed, especially for repetitive and parallel tasks like the base-band processing ones. Similarly, Digital Signal Processors (DSPs) can also offer great code optimization possibilities as some of them are capable of performing several multibit multiplications in parallel. However, both SIMD operations and DSP are tied to very specific hardware implementations which severely limit the portability of the code. In order to maximize the flexibility of the receiver, this kind of solutions is not further considered in this document.

#### 3. Bitwise Processing

Contrary to the SIMD operations, the bitwise processing (sometimes also referred in the literature as vector processing) uses a universal CPU instructions set and exploits the native bit representation of the signal. The data bits are stored in separate vectors—generally one sign and one or several magnitude words—on which bitwise parallel operations can be performed independently. The objective is to take advantage of the high parallelism and speed of the bitwise operations for which a single integer operation is translated into a few simple parallel logical relations.

Consequently, while the integer arithmetic interprets the data horizontally as a single word, the bitwise processing manipulates the bits separately in a vertical way, as illustrated in Figure 3.

Many software receivers (see, e.g., [3, 4]) exploit the bitwise processing. Depending on the configuration, the code and carrier mixing can be carried out by a few basic logical relations in parallel on several samples, making the architecture particularly efficient. However, as the data bits are vertically spread over several sign and magnitude words, a reconversion into the integer representation is finally required to perform the accumulation and the data readout. This is done at the cost of numerous bitwise operations that needlessly increase the complexity. In conclusion, the inherent drawback of the bitwise processing is the lack of flexibility as the complexity becomes bit-depth dependant and may increase drastically with respect to the data quantization.

#### 4. Distributed Arithmetic

The original concept of distributed arithmetic was developed for optimizing the implementation of digital filters into Field Programmable Gate Arrays (FPGAs). The main idea is to rearrange the multiplies and adds of a sum of products at the bit level to take advantage of small tables of precomputed sums (see, e.g., [5]). However this concept can also be adapted to a GNSS software receiver design in order to optimize the accumulations involved in the correlation process, as explained hereafter.

In Figure 1, the accumulation consists in summing up the consecutive samples (*n*), (*n*), and (*n*) and (*n*), (*n*), and (*n*) over the integration period as:

with being the number of samples per integration.

For notation simplicity in this paper, we only provide equations for the in-phase () components (the same operations apply for the quadrature components). We also omit the index of and

Let us express the signal as a linear combination of its quantization bits. For the sake of simplicity, we consider here the two’s complement notation which decomposes the signal as follows:

with (n) being the *m*th bit of the signal at the sampling instant .

We define the partial sum associated with the *m*th data bit of the signal as:

By combining and rearranging the terms of the above two equations, we obtain:

The accumulation is now expressed as a linear combination of partial sums . The challenge now consists in efficiently computing (3) for the bits of . Since is the arithmetic sum of all the bits contained in the word , it can be estimated by simply counting the number of bits equal to the logical value 1. Although some modern processors now propose an embedded instruction to perform this operation (which also limits the portability of the code), the most straightforward solution is to implement a Look-Up Table (LUT) that is directly addressed by the word itself and that outputs the corresponding partial sum .

Thanks to the above distributed arithmetic implementation, the conversion from the bitwise representation into the integer one is performed in parallel to the accumulation. Furthermore the architecture can easily accommodate various signal configurations as the complexity stays almost proportional to the bit-depth of the incoming signal .

#### 5. A GPS Implementation Example

Unlike the traditional bitwise approach, the distributed arithmetic requires the signal to accumulate to be expressed as a linear combination of its data bits (cf. (2)). This introduces an additional constraint on the former bitwise processing for carrier and code removal. To illustrate this aspect, let us consider the example of an incoming GPS satellite signal digitized with 2 bits per sample that are associated to the integer values 1 and 3.

In the receiver of Figure 1, is first mixed with a complex carrier quantized with 2 bits and associated to the integer values 1 and 2. The mixing of and results in the signals that can take one of the integer values shown in Table 2.

An appropriate binary representations for must be selected in order to minimize the number of bitwise operations necessary to perform the carrier and code mixing operations. In our example, bits are needed to represent in Table 2 if we use the following (heuristic) bits decomposition:

Using (5) and the sign and magnitude representation of and , Table 2 can be encoded as shown in Table 3.

We can now translate the truth table of Table 3 into the following logical equations that can be carried out with 7 operations or even 6 by storing one intermediate result:

Following the carrier mixing operations (cf. Figure 1), the code mixing simply consists in a sign inversion, respectively, noninversion, which can be translated into an exclusive OR between the code and all the respective signal bits :

With the above signal representation, the whole carrier and code mixing can be realized with 10 basic instructions that operate in parallel on 8, 16, 32, 64, or 128 bits depending on the CPU registers size .

From each of the words obtained with (7), the partial sum is calculated and summed up with the previous ones in order to form the final accumulation expressed as (using (4)):

As explained in the previous section, in order to save some operations, the partial sums are computed by the means of a LUT. The table must fit into the microprocessor cache to allow fast execution but must be also large enough to minimize the memory accesses.

#### 6. Performances Comparison

The efficiency of the proposed distributed arithmetic depends on the register size *R* of the host computer (i.e., bits can be processed in parallel). For the previous 2-bit data example, the total amount of operations becomes as shown in Table 4.

With respect to the integer base-band processing of Figure 1, the best improvement lies in the quasiabsence of integer multiplications, advantageously substituted by parallel logical operations. This way and assuming a 16-bit CPU and a 2-bit data quantization, the amount of integer additions to perform the correlation is reduced by almost 40%.

In comparison to the conventional bitwise processing, (2) may require more logical operations for the carrier and code mixing. On the other hand, no additional stage is needed to convert from bitwise into integer representation. Thus, in the case of a 2-bit data configuration as proposed in [6], the distributed arithmetic lowers the complexity by almost a factor two. It becomes even more efficient for higher signal bit-depths setup as the complexity grows almost proportionally with the data bit quantization (while it increases exponentially in the standard implementation such as in [6]).

#### 7. Conclusion

As the software implementation of standard base-band architectures is not really suitable for real-time operation, new approaches must be developed. While the bitwise processing represents a very interesting and popular alternative, taking advantage of the parallelism and universality of the basic logical CPU instructions, it still suffers from a lack of flexibility and scalability with respect to the integer operations. On the other hand, the distributed arithmetic constitutes the perfect bridge between bitwise and integer representations, combining the efficiency of the first with the flexibility of the latter. The concept is simple and elegant. This paper has demonstrated that for a standard configuration (2-bit signal and carrier quantization), the amount of arithmetic operations is divided by nearly a factor of two with respect to both integer arithmetic and bitwise processing implementations. Finally, while our example only considers GNSS, it is also applicable to other types of receivers such as DS-CDMA ones.