Abstract

Accelerating scalar multiplication has always been a significant topic when people talk about the elliptic curve cryptosystem. Many approaches have been come up with to achieve this aim. An interesting perspective is that computers nowadays usually have multicore processors which could be used to do cryptographic computations in parallel style. Inspired by this idea, we present a new parallel and efficient algorithm to speed up scalar multiplication. First, we introduce a new regular halve-and-add method which is very efficient by utilizing projective coordinate. Then, we compare many different algorithms calculating double-and-add and halve-and-add. Finally, we combine the best double-and-add and halve-and-add methods to get a new faster parallel algorithm which costs around less than the previous best. Furthermore, our algorithm is regular without any dummy operations, so it naturally provides protection against simple side-channel attacks.

1. Introduction

The elliptic curve was first imported into the world of cryptography by Neal Koblitz and Victor Miller independently in 1985 [1, 2] and is now increasingly used for a wide range of cryptography primitives in practice such as public encryption and digital signature. More than 30 years after its introduction to the cryptography field, the practical advantages of elliptic curve cryptosystem (ECC) are clear and well-known: it has richer algebraic structures, a smaller key size, and relatively faster implementations to achieve the same level of security compared with other deployed schemes such as RSA. Based on the above benefits, ECC is particularly suitable for resource-constrained devices.

The efficiency of ECC is dominated by the speed of calculating scalar multiplication. Namely, given a rational point of order on elliptic curves, it requires to compute ( times), for a given scalar . Obviously, there are similar features between scalar multiplication and exponentiation in a general multiplicative finite group. Therefore, inspired by the repeated “square-and-multiply” algorithm, the normally used binary method called “double-and-add” for scalar multiplication over elliptic curves has been regarded as a fundamental technique.

In constrained environments, scalar multiplication is easily implemented by “double-and-add” variant of Horner’s rule, providing binary expansion of scalar . However, each bit of implies different algorithmic path during each iteration, that is, if , only a point doubling is necessary. Whereas if , a point doubling followed by a point addition is involved. As a consequence, different power and time consumption of this two prominent building blocks can be detected by simple power analysis (SPA) [3] and timing attack—this naive implementation leads to information leakage of secret scalar .

Protecting against simple side-channel attacks (SSCA) can be achieved by recoding scalars in a regular manner, meaning that scalar multiplications are executed in the same instructions in the same order for any input value. Coron introduced a countermeasure against SSCA named “double-and-add always” algorithm [4]. By inserting a dummy operation when necessary, it evaluates scalar multiplication by executing a doubling and an addition in each loop. However, it was soon found to be vulnerable to safe-error fault attacks [5, 6]. By timely inducing a fault at one iteration during the point addition, an adversary can determine whether the operation is dummy or not by checking the correctness of the output.

A measurement against safe-error fault attacks performs scalar multiplication in a predictable pattern. Besides the most commonly used Montgomery-ladder algorithm [7], another efficient method is -ary recoding [8]. This algorithm recodes a scalar in a sequence of zeros and a nonzero with the percentage of nonzero numbers . However, scanning from look-up table could be dangerous if this step cannot be proceeded in constant-time.

Another increased interest-focused field of regular executing scalar multiplication is exploiting efficient curve forms that allow complete addition law. For any pair of -rational points on elliptic curves (or in desired subgroup), complete addition law can compute the correct result, ignoring whether two addends are identical or not. As a corollary of the main results in [9], elliptic curves embedded in any projective spaces of dimension by a symmetric line bundle admit a complete system of addition laws of bidegree . The later work of Bosma and Lenstra [10] shows that, when suitably chosen, a single addition law is able to act as add operation for all pairs of -rational elliptic points. One of the well-studied examples is Edwards curves [11, 12], of which exceptional pairs for addition law exist outside -rational points. A recent work [13] proposed an optimized algorithm that adds any pair of -rational points for prime order elliptic curves defined over field of characteristic different from and .

In [14], the authors introduce a new approach for scalar multiplication called Montgomery-halving algorithm which is a variation of the original Montgomery-ladder point multiplication. Besides, they present a new strategy for parallel implementation of point multiplication over elliptic curves by running the Montgomery-halving algorithm with the original Montgomery-ladder algorithm in parallel to calculate scalar multiplication concurrently. Moreover, this parallel algorithm can achieve protecting against SSCA. However, in their scheme, affine coordinate has to be used for halving, because the projective form of the Montgomery-halving algorithm could not be used to save operations.

In this paper, we provide a similar parallel implementation method using regular recoding technique which should be highly efficient by parallel processing doubling and halving operations in two different coprocessors. It can be concluded as two main contributions.

The first contribution is that we give a new regular algorithm computing halving operation called zero-less signed-digit (ZSD) halve-and-add which saves around and cost compared with Montgomery-halving method in [14] with m = 233 and m = 409. The projective coordinate system could offer projective coordinates saving inversions. This is especially useful for our ZSD halve-and-add algorithm (Algorithm 1). For halving operation, the best coordinate is affine coordinate. For the following addition operation, the better choice is projective coordinate. The Montgomery-halving algorithm in [14] has to exploit affine coordinate for its special structure without other choices, while our Algorithm 1 could make use of projective coordinate for its different structure design, where can always be in affine coordinate for halving and can always be in projective coordinate for addition so that projective mixed addition law could be used and no more coordinate transformation needed. In addition, the regular recoding technique ensures the secure implementation of scalar multiplication against SSCA.

Input: of odd order , with
Output:
(1);
(2)For down to 1 do
(3)  
(4)  ;
(5)End for
(6)
(7)Return

The second contribution concerns the new mixed-parallel algorithm. After analyzing all the algorithms in Table 1, we combine the fastest double-and-add method and Montgomery double-and-add method, in [14], and the fastest halve-and-add method, our ZSD halve-and-add algorithm, in this paper. A new efficient and secure mixed-parallel algorithm just comes into being, the mixed-parallel method, which costs around and less than Montgomery-Parallel approach in [14] when m = 233 and m = 409, respectively. The more thorough analysis will be exhibited in Section 4, and the related estimate results are all displayed in Tables 1 and 2.

The rest of this paper is organized as follows. In the next section, we introduce the related arithmetic knowledge of binary elliptic curves, especially on efficient coordinate point representation, twisted -normal form, and how to evaluate scalar multiplication in parallel by combining point halving and doubling operations. In Section 3, our new regular algorithm for halve-and-add is provided. Moreover, a similar parallel strategy as the one detailed in [14] shows how to efficiently implement scalar multiplication in a regular and parallel manner. Cost comparison and expected performance analysis are presented in Section 4. Finally, we conclude this paper and give the new mixed-parallel algorithm after analyzing.

2. Preliminaries

We focus on elliptic curves defined over binary fields , by the Weierstrass equation:where , is an irreducible polynomial of degree . Isomorphic to the divisor class group of degree , the rational points on together with the point at infinity form an abelian group, of which the basic group operation—addition—is algebraically interpreted by the tangent-and-chord law.

Given two points and on , where , if the addition of the two points is presented by , then the coordinates of can be computed according to the following formula:with .

Similarly, given , where , if the doubling of the point is presented by , then the coordinates of can be computed according to the following formula [15]:with .

From the above formulas, it is easy to notice that there are inevitable inversion operations in the base field, which would consume much time. Usually, the projective coordinate system is more welcome for its inclusion of no field inversions. In practice, various kinds of coordinate systems are already available to be used. The work in this paper prefers to exploit the state-of-the-art coordinate systems: coordinate and the projective coordinate system of twisted -normal form. They perform excellently in different situations.

2.1. Coordinates

Efficient point representation is of great importance to accelerate scalar multiplication. Inversion in the base field takes a large amount of time; however, they are indispensable if points are represented in affine coordinate. The homogeneous projective coordinate system (also called standard projective coordinate system) is usually used to eliminate this obstacle by injecting any -rational affine point into one of its projective copies , where . When one of the projective copies corresponds to the affine point , where , it is the Jacobian projective coordinate system. Later, López and Dahab proposed a new and efficient projective coordinate system. Compared with the above coordinate systems, the difference is here [16], denoted as LD coordinate for short. Later, Kim and Kim presented a four-dimensional LD coordinate system for binary curves which represents as , with , , , and .

The coordinate system was firstly noticed by Knudsen [17] when studying halving operations on binary elliptic curves. Oliveira [18] further surveyed its comprehensive arithmetic. Given a point with , the affine representation of is defined as , where . So, it is easy to derive point addition and doubling formulas of points in affine coordinates from the normal affine ones. Let and be two points on , where , then the formula for can be given by the following formula:

Referring to doubling operation, is given as follows:

As for projective conditions, the translation between affine representation and projective representation is defined by , with . The negative element of is . Assumed two points and represented in model on binary elliptic curves, similar to the affine case, the addition arithmetic could be described as the following formulas:and for , it could be given as follows:

The associated group addition and doubling operations can be calculated by and , respectively, where denotes a field multiplication and denotes a squaring.

Having the above formulas, a direct thought is to combine doubling and addition formulas to obtain a formula evaluating , which is of great importance in the latter part of this paper.

Let be points of , then can be computed as follows:

Using this, operations can be calculated efficiently by instead of , where denotes a field multiplication and denotes a squaring [18].

2.2. Twisted -Normal Form

Twisted -normal form [19] can be seen as the complement and extension of -normal form [20]. The related definitions, theorems, equation forms, and group laws of twisted -normal form and -normal form are given by Kohel's series of papers [1922]. There are three forms for (twisted) -normal form, called (twisted) -normal form, (twisted) semisplit -normal form, and (twisted) split -normal form separately. Yet, for practical consideration, only twisted spilt -normal form will be used here.

Let be an elliptic curve over characteristic-two finite field in the twisted split -normal form:and let and be two points on the curve. A complete system of addition laws is given by the two following two maps:respectively, where

For the point , the doubling map sends it to if , and toif . In twisted split -normal form, addition operations of generic points can be evaluated by and doubling operations of a generic point can be evaluated by with notations for field multiplication and for squaring [19].

Among all the studied coordinate systems on binary curves, twisted -normal form and projective coordinate appear to be faster. The difference is twisted -normal form is better calculating double-and-add, while projective coordinate can be used in halving operation. The costs of different point operations using various point representing systems are shown in Table 3.

2.3. Halving Operation

The main ingredient we consider is a cyclic subgroup in of odd order , denoted as . The multiple-by-2 isogeny on is an isomorphism, so is its inverse map halving operation . The use of point halving to speedup scalar multiplication was firstly investigated by Knudsen [17]. Given a point , it allows to compute another point satisfying in the cost of a field multiplication, calculating a square root and solving a quadratic equation, which could be directly understood from the formulas below:

The most commonly used method is to solve the second equation for , then the third one for , and finally the first one for .

When coordinate like is used instead of affine coordinate , where , the halving operation formulas would be changed as follows:

This time we just need two steps, that is to say, solve the first equation for and then the second one for . Without computing , the halving point coordinates of can be obtained more simply.

As proved in [23], solving a quadratic equation on binary curves with equivalents to computing the half-trace function . Although extra memory resources are needed, Fong et al. [23] showed a technique to significantly reduce the required time and space. With dedicated implementation, a point halving is approximately twice the time of a field multiplication, significantly faster than the customarily used point doubling.

From the algorithmic view, the halve-and-add method [17] expands a scalar in radix- representation system. Let be the binary length of , first compute , that is, . Much similar to double-and-add, point multiplication,can be efficiently computed by applying point halving on an accumulator. It can be further optimized combining methods like -NAF to get a better implementation performance, as shown in [23].

Enlightened by the treatment in halve-and-add, if we choose an appropriate number less than , the scalar can be split into two parts naturally. In consequence, the halve-and-add method is easy to be concurrently implemented with the double-and-add algorithm in parallel model, making use of increasing cores in modern processors, which would be a lot faster than applying one algorithm without parallel implementation (some inevitable computation load should be considered in advance). Specifically speaking, if the lengths of is and a proper has been chosen, the scalar can be split into two portions applying halve-and-add and double-and-add algorithms simultaneously, which can be indicated as follows—the length of each part ( and ) depends on actual implementation speed of halving and doubling which can be found experimentally:

If we already have the binary expression of with odd order , then it is easily derived that . The scalar multiplication of is then split into two parts directly:

The first part is easily executed in the halve-and-add method; meanwhile, the second part can be performed through a double-and-add approach, in two different threads.

As far as side-channel attacks being concerned, noticing that double-and-add can be implemented using Montgomery-ladder point multiplication, Negre and Robert [14] presented analogous Montgomery-halving algorithm. During each iteration, two registers hold fixed difference---2, and the algorithm processes a point halving and an addition in each iteration. However, as noticed by the authors, this parallel algorithm can only be implemented in affine coordinate, since halving operation cannot be implemented in the projective coordinate efficiently. To overcome this drawback, we present another regular recoding algorithm that can be used when implementing parallel halve-and-add/double-and-add in the projective coordinate system.

3. Regular Implementation

Protecting the implementation of scalar multiplication against SSCA can be achieved by many methods. Compared with unprotected implementation, algorithmic countermeasures like recoding scalars in a regular manner always sacrifice efficiency, yet may be easily mitigated by taking advantage of inherent parallelism of modern processors.

3.1. Zero-Less Signed-Digit Expansion

In general, point addition and doubling of elliptic curves are very different from the usual arithmetic operations, which are so complicated and time consuming that plenty of scholars have been sparing no effort to find efficient approaches to speed them up like work in this paper. As is well known, the negative of a point is a very cheap operation ensuring subtraction of points on elliptic curves being just as efficient as addition. This motivates modifying the binary method to signed-digit representations, that is to say, the scalar is usually represented by digits in the set of instead of . As we all know, there are many kinds of signed-digit representations. For achieving our aim, in this paper, zero-less signed-digit expansion is chosen to be used to come up with regular algorithms improving the resistance of scalar multiplication against timing attack and SPA.

Zero-less signed-digit expansion [24] (ZSD) is a highly regular scalar recoding algorithm that expresses an odd integer with digits in . is usually denoted as . Since bit is avoided in recoded sequence, each iteration of point multiplication requires a double-and-add operation, providing a natural protection against timing attack and SPA.

Let be the binary expansion of a scalar . Note that for any sequence of consecutive bits , the above expansion can be rewritten as , i.e., . Similar treatment is able to be applied to radix- expansion of , since . When applying halve-and-add algorithm, any consecutive bits can be rewritten as bits as well. So if is an odd integer, its radix-2 ZSD expansion (or its corresponding one based on radix- represented by ) with can be obtained from

From a security standpoint, every bit should be nonzero. When is even, it requires a special treatment. This can be circumvented by computing with the least significant bit of forced into 1 and finally subtracting (or in the corresponding condition) from the so-obtained result if bit is zero. The three algorithms in this paper applied this way to deal with or correctly whether the input is even or not.

Having known enough about ZSD expansion, we will get regular algorithms combining ZSD expansion and common binary methods to calculate the scalar multiplication. Algorithm 2 illustrates the regular ZSD double-and-add method based on radix-2 expansion from left to right, while Algorithm 3 does it from the opposite side.

Input: of odd order , with
Output:
(1);
(2)For down to 1 do
(3)  
(4)  
(5)End for
(6)
(7)Return
Input: of odd order , with
Output:
(1)
(2)If then
(3)  
(4)Else
(5)  
(6)For to do
(7)  
(8)  ;
(9)End for
(10)
(11)Return

Algorithms 2 and 3 give regular binary methods to evaluate elliptic scalar multiplication based on radix-2 expansion. When it comes to calculating , a similar condition based on radix- has to be considered, for which the halve-and-add method is needed. Referring to Algorithms 2 and 3, with a slight modification, we get Algorithm 1 for regular halving operation.

3.2. Parallelized Regular Scalar Multiplication

Let be the point of odd order with bit length and a scalar . The parallelized double-and-add/halve-and-add algorithm for scalar multiplication can be described in the following three parts including preprocessing, implementing, and postprocessing. Moreover, we may have a better view of the whole process from Figure 1.Preprocessing: select a proper and compute .So , where is the most significant bits and is the least significant bits of . This equation indicates .Implementing: point multiplication can be done by concurrently implementing in the binary method, in radix- method in two different threads. In detail,(1)Feed parameters and as inputs to the regular double-and-add algorithm, exploiting Algorithm 2 or Algorithm 3, in one thread. The final result point is stored in the register.(2)In the meanwhile, feed parameters and as inputs to regular halve-and-add algorithm, Algorithm 1, in another thread. The final result point is stored in the register.Postprocessing: a single-point addition is operated to obtain the correct result of scalar multiplication.

4. Comparison and Expected Performance

Numerous standards have included NIST-recommended curves as implementation abelian groups for cryptographic protocols. The general conclusion in Tables 1 and 2 is specifically for NIST-recommended random curves having the form , where is an element in . To allow easy comparison, the two considered curves with estimate results in this section are NIST B-233 and NIST B-409, defined by and over , respectively.

4.1. Analysis

The theoretic complexity analysis of the four considered scalar multiplication approaches is reported in Table 1. Our work is to improve the algorithms in [14] and give a better new parallel algorithm for evaluating scalar multiplication (Algorithms 2 and 3 have the similar complexity, and just Algorithm 2 will be talked about in the following parts.)

For regular implementation against SSCA, the Montgomery methods and our new methods here both need m doubling and m addition point operations for double-and-add algorithms and m halving and m addition point operations for halve-and-add algorithms. To be specific, in Montgomery-D, , and mean doubling and addition operations of a very efficient Montgomery double-and-add algorithm in [25]. It is so excellent that only field operations are enough for Montgomery-D, where and represent field multiplication and inversion. In Montgomery-H, and are halving and addition operations in the affine coordinate. Halving usually includes computing field multiplication, trace, solving the quadratic equation, and computing the square root operations. According to the analysis and experimental results in [14, 23], we can assume halving in affine coordinate needs field operations while field operations for projective coordinate. Besides, , addition in the affine coordinate needs field operations. Unavoidably, the structure of Montgomery-H algorithm requires to use affine coordinate only, because no proper projective coordinate could be applied here so far, which influences its efficiency significantly. It can be easily seen from the estimate results later.

In Algorithm 2, and represent doubling and addition in projective coordinate separately. and represent doubling and addition in twisted projective coordinate. As for their corresponding field operations, and requires and , while and require and . In Algorithm 1, means halving in affine coordinate and requires . Specially, if the mixed addition operation and the formula of calculating in Section 2.1 could be exploited, the field operations of Algorithm 2 will be for projective coordinate, in which is the cost of final step mixed addition in Algorithm 2 and is the cost of transforming the final result from projective coordinate to affine coordinate. When it turns to twisted projective coordinate, field operations are needed, in which is the cost of final step mixed addition in Algorithm 2 and is the cost of transforming the final result from twisted projective coordinate to affine coordinate. As for Algorithm 1, the mixed projective coordinate system could be applied saving inversion operations owing to the different algorithm structures of Algorithm 1 from Montgomery-H. Similarly, field operations are supposed to be consumed here.

In this work, we assume and ignore for squaring multiplication here referring to [14, 23]. In fact, squaring is nearly the fastest among all the field operations we talk about in this paper and usually is less than , so we can ignore it. For , it is a commonly used reference value. Yet on most occasions, may be bigger than , where Montgomery-H will be influenced most while the other three methods are almost unaffected. This is also the benefit of using the projective coordinate system. Having known all above cost comparison, two examples of NIST B-233 and B-409 are illustrated in Table 1 for easier understanding.

For double-and-add, the Montgomery-D algorithm is so outstanding that Algorithm 2 still could not catch up with the speed of it even using the twisted projective coordinate system, which is the fastest to date. For halve-and-add, Algorithm 3 saves and cost compared with Montgomery-H with m = 233 and m = 409. That means our algorithm for regular halve-and-add is much more useful in practice by using projective coordinate. When making use of the faster algorithm, the parallel method would also be much more efficient.

One may ask why the mixed projective coordinate system could not be applied to Montgomery-H. It seems that comparing these two algorithms in different coordinate systems is so unfair. To be honest, it is not our tricks to do this on purpose. If we take a good look at Montgomery-H in [14], supposing that we already have in affine coordinate and in projective coordinate when , we would meet the dilemma of transforming into affine coordinate for halving operation and into projective coordinate for mixed addition operation in order to save inversions when . Every time the consecutive two bits are different, the transformation has to be done. For a random bits binary number, if its leftmost bit is 1, then the average number of next to or next to is approximately . Transforming from projective coordinate into affine coordinate equals to field operations. Taking these costs into account, Montgomery-H has to use around field operations, which is more than applying affine coordinate. So the best solution to deal with the problem is to use a new structure like Algorithm 1.

4.2. Parallel and New Discovery

Negre and Robert [14] get inspiration from [26] and utilize a split technique similar to the one introduced in [26]. They also provide a Montgomery-halving algorithm like the original Montgomery-ladder scalar multiplication method. By carrying out these knowledge, a parallel method using Montgomery-D and Montgomery-H algorithms is presented. It is a pity that the Montgomery-H method from [14] can only use affine coordinate for its special structure. Aiming at solving this, we come up with a new regular parallel approach including Montgomery-D and Algorithm 1, which we call it mixed-parallel.

After analyzing each algorithm in Section 4.1, we can take a suitable split to see complexity in parallel condition. The specific results are shown in Table 2. In the Algorithm column, Montgomery-parallel is the parallel algorithm in [14] meaning executing Montgomery-D and Montgomery-H concurrently in two different threads. Our mixed-parallel in the last line is the new united algorithm which applies Montgomery-D and Algorithm 1 simultaneously in different coprocessors.

It is evident that Montgomery-D has the least cost among all the algorithms in Table 1. However, either parallel method in Table 2 has less cost than Montgomery-D. Let us compare the Montgomery-parallel and Montgomery-D first. It turns out that Montgomery-parallel algorithm saves and cost than of Montgomery-D when m = 233 and m = 409. As a consequence, parallel is indeed a good idea for computing scalar multiplication. Furthermore, if we combine the best double-and-add Montgomery-D algorithm and the best halve-and-add Algorithm 1-H, a new efficient parallel method, mixed-parallel, jumps into our sight giving new hope. Estimating results demonstrate that our mixed-parallel method costs and less than that of Montgomery-parallel when m = 233 and m = 409, respectively. This is a new discovery and record.

5. Conclusion

In this paper, we present a new parallel algorithm to improve the Montgomery algorithm in [14]. The two methods both take advantage of inherent parallelism of modern processors constructing parallel approaches. Instead of using Montgomery-like idea, a regular recoding technique is applied in our approach which is supposed to be highly efficient by processing double-and-add and halve-and-add in a parallel way. The regular method could protect the computing process against SSCA like Montgomery thought.

After the careful analysis of these algorithms, we could draw the conclusion that our regular halve-and-add approach, Algorithm 1, could use projective coordinate making up for the disadvantage of Montgomery-H saving about and cost compared with that of Montgomery-H with m = 233 and m = 409.

As a result, combining Montgomery-D and Algorithm 1, a new preferable parallel approach is born, our mixed-parallel. It costs and less than that of Montgomery-parallel when m = 233 and m = 409, respectively. This is a new record as well as a good improvement and supplement to the previous excellent work of [14].

Data Availability

All data generated or analyzed during this study are included in this published article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Nos. 61872442, 61772515, 61502487, and U1936209); the National Cryptography Development Fund (No. MMJJ20180216); and the Beijing Municipal Science & Technology Commission (Project no. Z191100007119006).