Abstract
The problem of arithmetic operations performance in number fields is actively researched by many scientists, as evidenced by significant publications in this field. In this work, we offer some techniques to increase performance of software implementation of finite field multiplication algorithm, for both 32bit and 64bit platforms. The developed technique, called “delayed carry mechanism,” allows to preventing necessity to consider a significant bit carry at each iteration of the sum accumulation loop. This mechanism enables reducing the total number of additions and applies the modern parallelization technologies effectively.
1. Introduction
The cryptographic transformations with public key are revolutionized from Diffie and Hellman consideration to modern algebraic curves cryptosystems [1]. However, transformations have stayed permanent—with operations in the number field . The integer multiplication takes a special place in number field operations; see Figure 1. One of the urgent problems of public key cryptosystem improvements is an increase of software performance and hardware implementation. One of the approaches to increasing cryptosystems performance is the increasing of the performance of finite field arithmetic in multiplication operations.
The problem of the speedup of arithmetic operation in number fields is actively researched by many scientists, as evidenced by significant publications in this area [2–11]. Except the arithmetic operations algorithms, it is interesting to look at/study approaches to the architecture of software libraries [12–21] with field operations, which allow decreasing overheads on field operations in whole.
Publications analysis [2–10] enabled extracting the most effective multiplication algorithms, Comba [2, 3] and Karatsuba [3, 8, 10]. However, the Comba algorithm shows better results in tests performance (benchmark) of software implementations on modern platforms [3–9]. KaratsubaComba described multiplication (KCM) algorithm for the RISC processors in the article [8]. The KCM algorithm is an interesting symbiosis of Comba and Karatsuba algorithms, where Karatsuba algorithm is specially used for machine word multiplication. As a result, the main goal of this paper is to provide a suggestion for the effective increasing of software implementation of finite field number multiplication (squaring) via wellknown Comba algorithm [2, 3, 8]. Such researches were caused by the necessity of effective confirmation of software implementation of known algorithms for continuous development of modern 32bit and 64bit platforms. It is important to mention that last ten years have seen much development in the direction of the multicore CPU and multiCPU systems [8, 9].
2. Multiplication AlgorithmPrototype Description and Its Modification
Let us begin by introducing some notation and basic definitions. Carry is a digit that is transferred from one column of digits to another column of more significant digits during a calculation algorithm; is machine word size and is the number of machine words required to store a large integer. We present large integers (multipliers) as a set of machine words; see Figure 2. For example, if we have 65bit integer, we need three 32bit machine words to store it.
The Comba algorithm [2] is based on main loops p. 2 and p. 3 and nested loops p. 2.1 and p. 3.1 (Algorithm 1). In the low level of hierarchy, in loops p. 2.1 and p. 3.1 we will compute 64bit integer product which consists of two 32bit integers and .

The sum accumulation occurs in 32bit temporary variables , , and , on each iteration p. 2.1.2 and p. 2.1.3.
The final result of the assignment is temporary variables , , and which are changing at each iteration on p. 2.2.
Comba’s algorithm main drawbacks are as follows.(i)In nested loops p. 2.1 and p. 3.1 there is a sum accumulation with carry in 32bit temporary variables , , and , p. 2.1.2, p. 2.1.3 and p. 3.1.2, p. 3.1.3:2.1.2., , .2.1.3., .
In this case there are 3 additions of 32bit integer (includes 2 additions with carry) and 3 assignments of 32bit variables , and . The sum accumulation with carry takes place in each iteration of loop p. 2.1.(ii)In nested loops p. 2.1 and p. 3.1, for the sum accumulation, for 32bit variables , , and the transfers are considered using the assembler code for the implementation of addition operation with carry. This does not allow pairing and parallelizing [22]; therefore we observe an ineffective processor resource usage.(iii)Loops p. 2 and p. 3 cannot be effectively parallelized due to high internal linkage code because of carry consideration.
It is easy to obtain a computational complexity for Comba’s algorithm: where is an assignment operation of 32bit integers, is an addition operation of 32bit integers, and is a multiplication operation of 32bit integer.
Figure 2 illustrates the drawbacks of algorithm for and its impact on computational complexity of algorithm.
Modern CPUs allow the use of 64bit data types and operations to achieve better performance, but the algorithm is not adapted for their use.
In the upper part of the figure, there are two big coefficients and represented by three 32bit integers and , where and have a machineword bit size. Algorithm iterations are presented under the solidus. It should be noted that Comba’s algorithm implements wellknown long multiplication technique, with a small difference where the multiplier part multiplies all parts of other multipliers , in case of fulfillment condition (in columns).
Such approach leads not to strings addition (multiplication of intermediate results) as long multiplication but to columns addition. That allows finding a part of resulting product (under the solidus). Each multiplication is accompanied by the sum accumulation, as shown in Figure 3.
The computational complexity for will be
In the following steps of calculation procedure we eliminate the drawbacks.(i)The modern 32bit CPUs effectively implement the addition operations of 32bit and 64bit integers, using 32bit CPUs commands. That allows implementing a carry accumulation by the addition of 32bit variables in 64bit variableaccumulator that obviate the need for carry accounting and correction requirements after the addition of variables , , and . An accumulated carry will be accounted in the final iterations in the loops in p. 2 and p. 3.(ii)Modern CPUs have multicore architecture that allows them to execute several instruction flows at the same time. This property brings to parallel iterations execution in loops p. 2 and p. 3 by the OpenMP library [22–24].
The following notations are introduced in Algorithm 2.(i)Variable is used to denote 64bit variables, is used to denote bit variables;(ii)Operation is used to extract 32 the most significant bits in 64bit variable, and operation is used to extract 32 the least significant bits in 64bit variable.

It is not difficult to get a computational complexity of modified Comba’s algorithm: where is an assignment operation of 32bit integers, is an assignment operations of 64bit integers, is an addition operation of 32bit integers, is an addition operation of 32bit and 64bit integers, and is a multiplication of 32bit integers.
Figures 4 and 5 illustrate Algorithm 2 for ; computational complexity in this case will be
3. Comparison with Other Algorithms
In order to provide an objective comparison of given results, the authors have made the review of wellknown software math libraries [12–21] for public key cryptography. According to results review [25, 26], the software library GMP was an etalon [12]. GMP uses Karatsuba’s integer multiplication algorithm [12]. The comparison of software implementations will be done by comparing the execution average time of software implementation of Comba and modified Comba’s algorithms and implemented in GMP library for one million iterations.
To measure the algorithm performance of software implementation we can use protocols in fields of Table 1 from [27], except . These fields are recommended [3, 27] for usage in cryptographic application for different security levels. Table 1 gives a brief definition of fields and prime modules.
The proposed modified algorithm Comba and its prototypealgorithm Comba were implemented in C++, compiled with Microsoft Visual Studio 2010 in Release Win32 configuration with Maximize Speed parameter and SSE2 instruction support.
We will use the etalon library GMP v4.1.2 compiled with Microsoft Visual Studio. NET and instrumental application compiled with Microsoft Visual Studio 2010 in Win32 release configuration with Maximize Speed parameter and SSE2 instruction support.
In testing mainstream mobile platform with Intel Core i3 350 M CPU and desktop platform with Intel Pentium Dual Core E5400 were used.
Performance measurement timings for different algorithms, implementations, and CPU are shown in Table 2.
As we can see from the timing in Table 2, the proposed modification of the algorithm Comba has 1.5 times better time characteristic compared with GMP. Classic implementation of Comba’s algorithm is the slowest, which is confirmed by the theoretical estimation (as it contains a larger number of additions and assignment operations). In addition, proposed software implementation of multiplication algorithms is more efficient on Dual Pentium CPU with higher frequency than on Core i3 CPU with several instruction streams. This implementation of multiplication algorithms does not support parallelization; thus, a more powerful multicore CPU Core i3 with 4 instructions processing flows does not realize their full potential.
4. Conclusions
The research resulted in the following conclusions.(1)We ensure an increase in performance of software implementation of Comba’s integer multiplication algorithm for 1.5–2 times and surpass of performance of the popular math library GMP v4.1.2, an average for 1.5 times.(2)Modified Comba’s multiplication algorithm is preferred to Karatsuba’s algorithm [2] which is used in GMP library, because implementation of modified Comba’s algorithm is faster than Karastuba [2] implementation in GMP for modern hardware platform (32 and 64bit).(3)Delayed carry mechanism allows applying different parallelization techniques to modified Comba’s algorithm, for example, OpenMP [23], Intel Threading Building Blocks [28], NVIDIA CUDA [29], and OpenCL [30].
Recently, the microprocessors development increases the number of instruction processing flows. Thus, we should perform the necessity of suitable algorithms development for efficient parallelization.
NVIDIA has already proposed GPU with more than 256 cores and suitable CUDA toolkit [29] which allows creating valid multithread applications. This area is already under close monitoring and is demonstrated in publication [9, 31]. A further line of our research will focus on investigation and effective parallelization algorithms for arithmetic operations with integers.
Conflict of Interests
The authors declare that they have no conflict of interests.