Techniques for Performance Improvement of Integer Multiplication in Cryptographic Applications
The problem of arithmetic operations performance in number fields is actively researched by many scientists, as evidenced by significant publications in this field. In this work, we offer some techniques to increase performance of software implementation of finite field multiplication algorithm, for both 32-bit and 64-bit platforms. The developed technique, called “delayed carry mechanism,” allows to preventing necessity to consider a significant bit carry at each iteration of the sum accumulation loop. This mechanism enables reducing the total number of additions and applies the modern parallelization technologies effectively.
The cryptographic transformations with public key are revolutionized from Diffie and Hellman consideration to modern algebraic curves cryptosystems . However, transformations have stayed permanent—with operations in the number field . The integer multiplication takes a special place in number field operations; see Figure 1. One of the urgent problems of public key cryptosystem improvements is an increase of software performance and hardware implementation. One of the approaches to increasing cryptosystems performance is the increasing of the performance of finite field arithmetic in multiplication operations.
The problem of the speedup of arithmetic operation in number fields is actively researched by many scientists, as evidenced by significant publications in this area [2–11]. Except the arithmetic operations algorithms, it is interesting to look at/study approaches to the architecture of software libraries [12–21] with field operations, which allow decreasing overheads on field operations in whole.
Publications analysis [2–10] enabled extracting the most effective multiplication algorithms, Comba [2, 3] and Karatsuba [3, 8, 10]. However, the Comba algorithm shows better results in tests performance (benchmark) of software implementations on modern platforms [3–9]. Karatsuba-Comba described multiplication (KCM) algorithm for the RISC processors in the article . The KCM algorithm is an interesting symbiosis of Comba and Karatsuba algorithms, where Karatsuba algorithm is specially used for machine word multiplication. As a result, the main goal of this paper is to provide a suggestion for the effective increasing of software implementation of finite field number multiplication (squaring) via well-known Comba algorithm [2, 3, 8]. Such researches were caused by the necessity of effective confirmation of software implementation of known algorithms for continuous development of modern 32-bit and 64-bit platforms. It is important to mention that last ten years have seen much development in the direction of the multicore CPU and multi-CPU systems [8, 9].
2. Multiplication Algorithm-Prototype Description and Its Modification
Let us begin by introducing some notation and basic definitions. Carry is a digit that is transferred from one column of digits to another column of more significant digits during a calculation algorithm; is machine word size and is the number of machine words required to store a large integer. We present large integers (multipliers) as a set of machine words; see Figure 2. For example, if we have 65-bit integer, we need three 32-bit machine words to store it.
The Comba algorithm  is based on main loops p. 2 and p. 3 and nested loops p. 2.1 and p. 3.1 (Algorithm 1). In the low level of hierarchy, in loops p. 2.1 and p. 3.1 we will compute 64-bit integer product which consists of two 32-bit integers and .
The sum accumulation occurs in 32-bit temporary variables , , and , on each iteration p. 2.1.2 and p. 2.1.3.
The final result of the assignment is temporary variables , , and which are changing at each iteration on p. 2.2.
Comba’s algorithm main drawbacks are as follows.(i)In nested loops p. 2.1 and p. 3.1 there is a sum accumulation with carry in 32-bit temporary variables , , and , p. 2.1.2, p. 2.1.3 and p. 3.1.2, p. 3.1.3:2.1.2., , .2.1.3., .
In this case there are 3 additions of 32-bit integer (includes 2 additions with carry) and 3 assignments of 32-bit variables , and . The sum accumulation with carry takes place in each iteration of loop p. 2.1.(ii)In nested loops p. 2.1 and p. 3.1, for the sum accumulation, for 32-bit variables , , and the transfers are considered using the assembler code for the implementation of addition operation with carry. This does not allow pairing and parallelizing ; therefore we observe an ineffective processor resource usage.(iii)Loops p. 2 and p. 3 cannot be effectively parallelized due to high internal linkage code because of carry consideration.
It is easy to obtain a computational complexity for Comba’s algorithm: where is an assignment operation of 32-bit integers, is an addition operation of 32-bit integers, and is a multiplication operation of 32-bit integer.
Figure 2 illustrates the drawbacks of algorithm for and its impact on computational complexity of algorithm.
Modern CPUs allow the use of 64-bit data types and operations to achieve better performance, but the algorithm is not adapted for their use.
In the upper part of the figure, there are two big coefficients and represented by three 32-bit integers and , where and have a machine-word bit size. Algorithm iterations are presented under the solidus. It should be noted that Comba’s algorithm implements well-known long multiplication technique, with a small difference where the multiplier part multiplies all parts of other multipliers , in case of fulfillment condition (in columns).
Such approach leads not to strings addition (multiplication of intermediate results) as long multiplication but to columns addition. That allows finding a part of resulting product (under the solidus). Each multiplication is accompanied by the sum accumulation, as shown in Figure 3.
The computational complexity for will be
In the following steps of calculation procedure we eliminate the drawbacks.(i)The modern 32-bit CPUs effectively implement the addition operations of 32-bit and 64-bit integers, using 32-bit CPUs commands. That allows implementing a carry accumulation by the addition of 32-bit variables in 64-bit variable-accumulator that obviate the need for carry accounting and correction requirements after the addition of variables , , and . An accumulated carry will be accounted in the final iterations in the loops in p. 2 and p. 3.(ii)Modern CPUs have multicore architecture that allows them to execute several instruction flows at the same time. This property brings to parallel iterations execution in loops p. 2 and p. 3 by the OpenMP library [22–24].
The following notations are introduced in Algorithm 2.(i)Variable is used to denote 64-bit variables, is used to denote -bit variables;(ii)Operation is used to extract 32 the most significant bits in 64-bit variable, and operation is used to extract 32 the least significant bits in 64-bit variable.
It is not difficult to get a computational complexity of modified Comba’s algorithm: where is an assignment operation of 32-bit integers, is an assignment operations of 64-bit integers, is an addition operation of 32-bit integers, is an addition operation of 32-bit and 64-bit integers, and is a multiplication of 32-bit integers.
3. Comparison with Other Algorithms
In order to provide an objective comparison of given results, the authors have made the review of well-known software math libraries [12–21] for public key cryptography. According to results review [25, 26], the software library GMP was an etalon . GMP uses Karatsuba’s integer multiplication algorithm . The comparison of software implementations will be done by comparing the execution average time of software implementation of Comba and modified Comba’s algorithms and implemented in GMP library for one million iterations.
To measure the algorithm performance of software implementation we can use protocols in fields of Table 1 from , except . These fields are recommended [3, 27] for usage in cryptographic application for different security levels. Table 1 gives a brief definition of fields and prime modules.
The proposed modified algorithm Comba and its prototype-algorithm Comba were implemented in C++, compiled with Microsoft Visual Studio 2010 in Release Win32 configuration with Maximize Speed parameter and SSE2 instruction support.
We will use the etalon library GMP v4.1.2 compiled with Microsoft Visual Studio. NET and instrumental application compiled with Microsoft Visual Studio 2010 in Win32 release configuration with Maximize Speed parameter and SSE2 instruction support.
In testing mainstream mobile platform with Intel Core i3 350 M CPU and desktop platform with Intel Pentium Dual Core E5400 were used.
Performance measurement timings for different algorithms, implementations, and CPU are shown in Table 2.
As we can see from the timing in Table 2, the proposed modification of the algorithm Comba has 1.5 times better time characteristic compared with GMP. Classic implementation of Comba’s algorithm is the slowest, which is confirmed by the theoretical estimation (as it contains a larger number of additions and assignment operations). In addition, proposed software implementation of multiplication algorithms is more efficient on Dual Pentium CPU with higher frequency than on Core i3 CPU with several instruction streams. This implementation of multiplication algorithms does not support parallelization; thus, a more powerful multicore CPU Core i3 with 4 instructions processing flows does not realize their full potential.
The research resulted in the following conclusions.(1)We ensure an increase in performance of software implementation of Comba’s integer multiplication algorithm for 1.5–2 times and surpass of performance of the popular math library GMP v4.1.2, an average for 1.5 times.(2)Modified Comba’s multiplication algorithm is preferred to Karatsuba’s algorithm  which is used in GMP library, because implementation of modified Comba’s algorithm is faster than Karastuba  implementation in GMP for modern hardware platform (32- and 64-bit).(3)Delayed carry mechanism allows applying different parallelization techniques to modified Comba’s algorithm, for example, OpenMP , Intel Threading Building Blocks , NVIDIA CUDA , and OpenCL .
Recently, the microprocessors development increases the number of instruction processing flows. Thus, we should perform the necessity of suitable algorithms development for efficient parallelization.
NVIDIA has already proposed GPU with more than 256 cores and suitable CUDA toolkit  which allows creating valid multithread applications. This area is already under close monitoring and is demonstrated in publication [9, 31]. A further line of our research will focus on investigation and effective parallelization algorithms for arithmetic operations with integers.
Conflict of Interests
The authors declare that they have no conflict of interests.
M. Brown, D. Hankerson, J. Lopez, and A. Menezes, “Software implementation of the NIST elliptic curves over prime fields,” Research Report CORR 2000-55, Department of Combinatorics and Optimization, University of Waterloo, Waterloo, Canada, 2000.View at: Google Scholar
C. Paar, “Implementation options for finite field arithmetic for elliptic curve cryptosystems,” in Proceedings of the Elliptic Curve Cryptography (ECC '99), Worchester Polytechnic Institute, 1999.View at: Google Scholar
G. Gaubatz, Versatile montgomery multiplier architectures [M.S. thesis], Electrical and Computer Engineering, Worcester Polytechnic Institute, 2002.
J. Großschadl, R. M. Avanzi, E. Savaş, and S. Tillich, “Energy-efficient software implementation of long integer modular arithmetic,” in Proceedings of the 7th International Conference on Cryptographic Hardware and Embedded Systems (CHES '05), pp. 75–90, Springer, 2005.View at: Publisher Site | Google Scholar
P. Giorgi, T. Izard, and A. Tisserand, “Comparison of modular arithmetic algorithms on GPUs,” in Proceedings of the International Conference on Parallel Computing (ParCo '09), Lyon, France, 2009.View at: Google Scholar
The GNU Multiply Precision Library (GMP), http://gmplib.org.
Multiprecision Unsigned Number Template Library (MUNTL), http://mktmk.narod.ru/eng/muntl/muntl.htm.
Galois Field Arithmetic Library, http://www.partow.net/projects/galois/.
Multiprecision Integer and Rational Arithmetic C/C++ Library (MIRACL), http://indigo.ie/~mscott.
“Intel 64 and IA-32 Architectures Optimization Reference Manual,” Order Number: 248966-025, http://www.cs.princeton.edu/courses/archive/fall13/cos217/reading/ia32opt.pdf.View at: Google Scholar
The OpenMP API Specification for Parallel Programming, http://openmp.org/wp/openmp-specifications/.
OpenMP in Visual C++, http://msdn.microsoft.com/en-us/library/tt15eb9t.aspx.
A. Abusharekh and K. Gaj, “Comparative analysis of software libraries for public key cryptography,” in Proceedings of the Software Performance Enhancement for Encryption and Decryption (SPEED '2007), June 2007.View at: Google Scholar
P. Giorgi, L. Imbert, and T. Izard, “Multipartite modular multiplication,” http://hal.archives-ouvertes.fr/lirmm-00618437/fr/.View at: Google Scholar
National Institute of Standards and Technology, “Recommended elliptic curves for federal government use,” Appendix to FIPS 186-2, 2000.View at: Google Scholar
Intel Threading Blocks, http://software.intel.com/en-us/articles/intel-tbb.
NVIDIA, “NVIDIA CUDA Programming Guide 2.0,” http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf.View at: Google Scholar
T. Güneysu and C. Paar, “Ultra high performance ECC over NIST primes on commercial FPGAs,” in Cryptographic Hardware and Embedded Systems—CHES 2008, E. Oswald and P. Rohatgi, Eds., vol. 5154 of Lecture Notes in Computer Science, pp. 62–78, Springer, Berlin, Germany, 2008.View at: Publisher Site | Google Scholar