Abstract

This paper proposes a hardware-efficient elliptic curve cryptography (ECC) architecture over GF(p), which uses adders to achieve scalar multiplication (SM) through hardware-reuse method. In terms of algorithm, the improvement of the interleaved modular multiplication (IMM) algorithm and the binary modular inverse (BMI) algorithm needs two adders. In addition to the adder, the data register is another optimize target. The design compiler is synthesized on 0.13 µm CMOS ASIC platform. The time range of performing scalar multiplication over 160, 192, 224, and 256 field orders under 150 MHz frequency is 1.99–3.17 ms. Moreover, the gate area required for different field orders in this design is in the range of 35.65k–59.14k, with 50%–91% hardware resource less than other processors.

1. Introduction

Due to the rapid development of technology, Internet of Things- (IoT-) related devices have become popular. Most importantly, the safety must be guaranteed. In addition to the IoT devices, the safety of road networks also needs to be paid great attention [1]. Miller [2] and Koblitz [3] put forward the concept of elliptic curve cryptography (ECC), which is a kind of asymmetrical cryptosystem put forward by Miller [2] and Koblitz [3] in 1986, which has higher security than other methods like RSA encryption algorithm. Several international organizations have adopted ECC, including NIST [4], ANSI [5] and IEEE [6].

For ECC, there have been a large number of hardware architectures [717]. Among them, there are two methods for the realization of modular multiplication (MM), namely, the multiplier and the adder. The multiplier-based architecture includes the design based on specific prime field and the design based on Montgomery multiplication algorithm [7]. The adder-based architecture includes the design based on interleaved multiplication algorithm [9]. The processor [13] uses a design with Montgomery MM algorithm and r-bit r-bit multiplier. The processors [8, 14] use a design with n-bit n-bit multiplier. MM includes multiplication and fast reduction operation over a specific prime field. It should be noted that the multiplier-based architecture requires a lot of hardware.

In ECC, modular inversion (MI) is also a kind of cumbersome operation. Among them, binary modular inversion algorithms are usually used in hardware-efficient architectures. The MM and MI units of processor [11] are based on the adder, and the two units are independent in adder. Processors [18, 19] adopt a radix-4 booth encoding IMM algorithm. Processor [20] implements MM through a radix-2 MM algorithm and avoids MI through projective coordinates.

Traditional cryptographic algorithm software has the disadvantages of high power consumption and time delay, which can be solved by hardware implementation. This article attempts to provide security assurance with low power consumption for IoT devices through hardware implementation. The following are the main contributions of this article.(1)A hardware-efficient architecture based on add units is proposed to achieve as little hardware consumption as possible(2)Through the modification of IMM algorithm and BMI algorithm with 2 full-word adders and four data registers, MM and MI can be realized(3)Registers are optimized to minimize hardware consumption, in which four full-word register units for MM, MS, MA, and MI and eight full-word register units for SM operation

The structure of this article is divided into four parts. First, the Mathematical Background section elaborates on EC operation and SM operation. Second, the Scalar Multiplication Architecture section introduces the hardware-efficient architecture over GF(p). Third, the Implementation and Result section shows the results and then conducts comparative analysis. Fourth, the Conclusion section is a summary.

2. Mathematical Background

2.1. Elliptic Curve over GF(p)

An introduction on EC over GF(p) is conducted. When the p value of nonsupersingular elliptic curve E on GF(p) is greater than 3, the following formula can be used:where a, b, x, and y are elements of GF(p), and . See [21, 22] for more information on elliptic curve cryptographic primitives.

Formulas (2) and (3) show the point adding (PA) operation and point doubling (PD) operation. With elliptic curve point and , the computing formula for PA is , and the computing formula for PD is .

2.2. Elliptic Curve Scalar Multiplication

In ECC, SM is the basic operation. As for PM operation, integer k and point P on the elliptic curve are input and then performed as a sequence of PA and PD operations given in Algorithm 1. In Step 1, the point Q is initialized as a point at infinity. Step 2 is to iterate times, where each iteration has the PD operation. indicates that there is a PA operation.

Input: an integer and a point on elliptic curve
Output:
(1);
(2)for ( ) {(2.1);(2.2)if , {; }}
(3)return

3. Scalar Multiplication Architecture

This part describes the bottom-up algorithm optimization on GF(p), which achieves maximum reuse by adder unit. The SM operation is implemented by using two full-word adder units. The optimization of MM and MI operations is conducive to the reduction of power consumption and the improvement of SM operation’s performance.

3.1. Modular Addition/Subtraction

MA and MS operations are implemented based on Algorithm 2. In ASIC, the addition or subtraction operations can be implemented using nearly equal hardware, namely, adder units. Since MA and MS operations require a clock cycle, there is a need for 2 full-word adders. In addition, here, the adder unit is the minimum unit, and and are the most significant bits (MSB).

Input: p, A, B ∈ [0, p − 1]
Output: R = (A + B) mod p
(1)C0 = A + B
(2)C1 = C0 − p
(3)if C1n = 1 {R = C0}
(4)else {R = C1}
(5)return R
Input: p, A, B ∈ [0, p − 1]
Output: R = (A − B) mod p
(1)C0 = A − B
(2)C1=C0 + p
(3)if C0 n = 1 {R = C1}
(4)else {R = C0}
(5)return R
3.2. Modular Multiplication

MM is an indispensable operation in SM operation architecture. In this study, the interleaved modular multiplication algorithm is selected. The standard interleaved modulo multiplication in [16] (Algorithm 2) has certain shortcomings. Since steps 5, 6, and 7 carry out addition operations with carry propagation and steps 6 and 7 check all lengths of the operands, there is a large latency. In response to this problem, the improved algorithm in [16] (Algorithm 3) performs addition operations with carry-save adders in the loop. Moreover, the modified algorithm in [16] (Algorithm 4) reduces the area and time by lookup-table method. In [10], a new interleaved modular multiplication algorithm is proposed, which uses only two adder units. The specific steps are shown in Algorithm 5 as follows. In step 1, the variable R is initialized to zero. In step 2.1, the R ∗ 2 can be realized by shifting operation. In step 2.2, the Xi ∗ Y can be implemented by a multiplexer. Step 2.3 and step 2.4 require an adder unit, respectively. Therefore, if each iteration is completed within one clock cycle, then a total of two adder units are required. After the iteration of step 2, the result is limited to [0, 2p − 1]. Therefore, it is necessary to go to step 3 to limit R to [0, p − 1].

Input:
Output: , satisfying
Step 1: ;
Step 2: if ( is even) {
;
 if (r is odd) ;
 else if (r is negative) ;
 else ;
 }
Step 3: if ( is even) {
;
 if (s is odd) ;
 else if (s is negative) ;
 else;
 }
Step 4: if (u and are odd) {
 if() ;
 else ;
 }
Step 5: if () {
  if () {return ; }
  else { return .}
 }
 else if () {
  if () {return ; }
  else {return .}
 }
 else {go to step2.}
Input: P1(x1, y1), P2(x2, y2),
Output: P3(x3, x3) = P1 + P2
(1)t2 = y1 − y2
(2)t1 = x1 − x2
(3)t1 = t2/t1
(4)t2 = t1 t1
(5)t2 = t2 − x1
(6)t2 = t2 − x2
(7)t2 = x2 − t2, x1 = t2
(8)t2 = t1 t2
(9)y1=t2y2
(10)return x3 = x1,y3 = y1
Input: P1(x1, y1) = P2(x2, y2)
Output: P3(x3, x3) = P1 + P2
(1)t2 = x1 x1
(2)t1 = t2 + t2
(3)t1 = t2 + t1
(4)t2 = t1 + a
(5)t1 = y1 + y1
(6)t1 = t2/t1
(7)t2 = t1 t1
(8)t2 = t2 − x1
(9)t2 = t2 − x2
(10)t2 = x2 − t2, x1 = t2,
(11)t2 = t1 + t2
(12)y1 = t2 − y2
(13)return x3 = x1, y3 = y1
Input:
Output:
(1);
(2)for(2.1);(2.2);(2.3);(2.4);}
(3)if {; }
(4)return
3.3. Modular Inversion

In addition to the MM operation, the modular inversion (MI) operation also plays an extremely important role in the SM operation architecture. In MI operation, the same two adder units are reused to reduce hardware consumption. This paper adopts the binary modular inversion algorithm proposed in [10]. Algorithm 3 can calculate MM and MI operations in the same clock cycle. If the input a = 1, it is an MI operation, and mod p. In step 1, the variables u, , r, s are initialized. In step 2 and step 3, the /2 operations can be realized by right shifting one bit. With a positive or negative odd r, , that is, it can be computed by adding r to p and then shifting right. The same is true in other situations. The above operations require one adder unit. In step 2 or step 3, the comparison between u and in step 4 is calculated in advance, which requires two adder units. In step 4, or is calculated, which requires two adder units. Step 5 requires a total of two adder units, one of which is used to determine whether r or s is less than 0, and the other is used for or . Therefore, if each step is completed in one clock cycle, two adder units are required.

3.4. Point Addition and Point Doubling

Algorithm 4 provides PA and PD operations. Since modular operations (MA, MS, MM, and MI) share the same two adder units, only one modular operation is computed at a time. A total of eight registers are required, of which six are used for PA and PD operations of and two for integer and prime .

3.5. Scalar Multiplier Architecture

In this part, Figure 1 shows the scalar multiplication architecture of SM on GF(p), which achieves the modular operations of MM, MS, MA, and MI as well as the point operations of SM, PA, and PD. Among them, point controller block is the main state machine that realizes the point operation, and modular controller block is the state machine that realizes the modular operation.

4. Implementation and Result

The ECC architecture described in this part is designed using Verilog-HDL language and adopt Design Compiler to synthesize it using SMIC 130-nm CMOS standard cell library. In addition, the experimental circuit area is evaluated by the 2-way NAND gate.

The source of the experimental simulation parameters is the FIPS 186-2 standard [8]. Figure 2 lists the main parameters for one 256 bit elliptic curve on the prime field GF(p) and the other bit elliptic curve can be found in the FIPS 186-2 standard. The coordinates of base point G on elliptic curve are Gx and Gy.

There is a need for a total of two adders and twelve data registers in the proposed architecture. According to Table 1, the required registers and adders consumed 42% of the hardware. Among them, the twelve registers are used for data storage. With the increase of field order, the adder’s resource consumption percentage increases from 13.72% to 15.54%.

In Table 2, the results of the implementation and comparison of the proposed architecture are shown. By testing 100 times, the SM operation requires an average of 186, 268, 364, and 475 clock cycles on 160, 192, 224, and 256 prime fields, respectively. The proposed architecture takes 1.24, 1.78, 2.42, and 3.16 ms with the 35.65 k, 43.25 k, 49.41 k, and 59.14 k gate area for one SM operations over 160, 192, 224, and 256 prime fields, respectively.

The authors of [10, 11] use IMM algorithm and BIA to realize the inversion and multiplier units. Among them, the processor in [10] and the processor we proposed use the same method, that is, use the same unit to implement MM and MI operations. But in contrast, the proposed design has higher performance on the prime fields of 160/192/224/256 bits, which is 1.28∼1.29 times faster than that of [10]. Under 160 bit prime field, the processor in [10] takes 35.43k gate area and 1.60 ms to perform an SM operation. In the area-time product (AT) parameter, the AT value of the processor we designed is relatively low, indicating that there is a better balance between hardware consumption and performance.

The processor in [11] uses two adder-based inversion units and two adder-based multiplier units, and our processor uses one combined unit. In contrast, the proposed processor has the advantage of low hardware consumption. In addition, our design saves 64.81%, 64.87%, 65.66%, and 64.69% area over the 160/192/224/256 bit prime fields than the design in [11]. Taking the 160 bit prime field as an example, the processor in [11] takes 101.3 k gate area and 0.87 ms. Although it has higher performance, the design we propose chooses a lower AT value in order to balance hardware consumption and performance. In summary, the proposed processor has the advantages of low hardware consumption and high hardware efficiency.

The processor in [13] uses a word-based Montgomery multiplier and dynamic redundant binary converter, which can improve the performance of SM. Compared with the design in [13], our design can save 69.66%, 63.35%, 58.93%, and 50.84% area over the 160/192/224/256 bit prime fields.

The processor in [14] causes large power consumption, which is not suitable for IoT devices. More specifically, a full-size 256 bit × 256 bit multiplier requires a large hardware consumption, namely, 659 k gate. In contrast, the proposed design can save 91.03% of the area. The processor in [15] uses a systolic arithmetic unit in high frequency of 556 MHz. Based on the 256 bit prime fields, our design can save 51.52% of the area.

Compared with the abovementioned processors in [10, 11, 13, 14] and [15], our proposed processor has the least hardware consumption.

5. Conclusion

By constructing a bottom-up optimization for all operations of algorithm-level scalar multiplication on the basis of two full-word adders, a hardware-efficient elliptic curve processor over GF(p) is proposed. Through the improvement of IMM algorithm and BMI algorithm, they become suitable for two adder units. Moreover, the registers are also optimized. A total of 12 full-word register units are used to store data. Synthesized on 0.13 µm ASIC platform, the processor’s hardware consumption can be controlled within the range of 35.65 k∼59.14 k, which is far lower than most processors.

Data Availability

The raw/processed data required to reproduce these findings cannot be shared at this time as the data also form part of an ongoing study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Science and Technology Planning Project of Guangdong Province of China (nos. 2015B010128013, 2015B090911001, and 2015B090912001).