Abstract
This paper proposes a hardwareefficient elliptic curve cryptography (ECC) architecture over GF(p), which uses adders to achieve scalar multiplication (SM) through hardwarereuse method. In terms of algorithm, the improvement of the interleaved modular multiplication (IMM) algorithm and the binary modular inverse (BMI) algorithm needs two adders. In addition to the adder, the data register is another optimize target. The design compiler is synthesized on 0.13 µm CMOS ASIC platform. The time range of performing scalar multiplication over 160, 192, 224, and 256 field orders under 150 MHz frequency is 1.99–3.17 ms. Moreover, the gate area required for different field orders in this design is in the range of 35.65k–59.14k, with 50%–91% hardware resource less than other processors.
1. Introduction
Due to the rapid development of technology, Internet of Things (IoT) related devices have become popular. Most importantly, the safety must be guaranteed. In addition to the IoT devices, the safety of road networks also needs to be paid great attention [1]. Miller [2] and Koblitz [3] put forward the concept of elliptic curve cryptography (ECC), which is a kind of asymmetrical cryptosystem put forward by Miller [2] and Koblitz [3] in 1986, which has higher security than other methods like RSA encryption algorithm. Several international organizations have adopted ECC, including NIST [4], ANSI [5] and IEEE [6].
For ECC, there have been a large number of hardware architectures [7–17]. Among them, there are two methods for the realization of modular multiplication (MM), namely, the multiplier and the adder. The multiplierbased architecture includes the design based on specific prime field and the design based on Montgomery multiplication algorithm [7]. The adderbased architecture includes the design based on interleaved multiplication algorithm [9]. The processor [13] uses a design with Montgomery MM algorithm and rbit rbit multiplier. The processors [8, 14] use a design with nbit nbit multiplier. MM includes multiplication and fast reduction operation over a specific prime field. It should be noted that the multiplierbased architecture requires a lot of hardware.
In ECC, modular inversion (MI) is also a kind of cumbersome operation. Among them, binary modular inversion algorithms are usually used in hardwareefficient architectures. The MM and MI units of processor [11] are based on the adder, and the two units are independent in adder. Processors [18, 19] adopt a radix4 booth encoding IMM algorithm. Processor [20] implements MM through a radix2 MM algorithm and avoids MI through projective coordinates.
Traditional cryptographic algorithm software has the disadvantages of high power consumption and time delay, which can be solved by hardware implementation. This article attempts to provide security assurance with low power consumption for IoT devices through hardware implementation. The following are the main contributions of this article.(1)A hardwareefficient architecture based on add units is proposed to achieve as little hardware consumption as possible(2)Through the modification of IMM algorithm and BMI algorithm with 2 fullword adders and four data registers, MM and MI can be realized(3)Registers are optimized to minimize hardware consumption, in which four fullword register units for MM, MS, MA, and MI and eight fullword register units for SM operation
The structure of this article is divided into four parts. First, the Mathematical Background section elaborates on EC operation and SM operation. Second, the Scalar Multiplication Architecture section introduces the hardwareefficient architecture over GF(p). Third, the Implementation and Result section shows the results and then conducts comparative analysis. Fourth, the Conclusion section is a summary.
2. Mathematical Background
2.1. Elliptic Curve over GF(p)
An introduction on EC over GF(p) is conducted. When the p value of nonsupersingular elliptic curve E on GF(p) is greater than 3, the following formula can be used:where a, b, x, and y are elements of GF(p), and . See [21, 22] for more information on elliptic curve cryptographic primitives.
Formulas (2) and (3) show the point adding (PA) operation and point doubling (PD) operation. With elliptic curve point and , the computing formula for PA is , and the computing formula for PD is .
2.2. Elliptic Curve Scalar Multiplication
In ECC, SM is the basic operation. As for PM operation, integer k and point P on the elliptic curve are input and then performed as a sequence of PA and PD operations given in Algorithm 1. In Step 1, the point Q is initialized as a point at infinity. Step 2 is to iterate times, where each iteration has the PD operation. indicates that there is a PA operation.

3. Scalar Multiplication Architecture
This part describes the bottomup algorithm optimization on GF(p), which achieves maximum reuse by adder unit. The SM operation is implemented by using two fullword adder units. The optimization of MM and MI operations is conducive to the reduction of power consumption and the improvement of SM operation’s performance.
3.1. Modular Addition/Subtraction
MA and MS operations are implemented based on Algorithm 2. In ASIC, the addition or subtraction operations can be implemented using nearly equal hardware, namely, adder units. Since MA and MS operations require a clock cycle, there is a need for 2 fullword adders. In addition, here, the adder unit is the minimum unit, and and are the most significant bits (MSB).

3.2. Modular Multiplication
MM is an indispensable operation in SM operation architecture. In this study, the interleaved modular multiplication algorithm is selected. The standard interleaved modulo multiplication in [16] (Algorithm 2) has certain shortcomings. Since steps 5, 6, and 7 carry out addition operations with carry propagation and steps 6 and 7 check all lengths of the operands, there is a large latency. In response to this problem, the improved algorithm in [16] (Algorithm 3) performs addition operations with carrysave adders in the loop. Moreover, the modified algorithm in [16] (Algorithm 4) reduces the area and time by lookuptable method. In [10], a new interleaved modular multiplication algorithm is proposed, which uses only two adder units. The specific steps are shown in Algorithm 5 as follows. In step 1, the variable R is initialized to zero. In step 2.1, the R ∗ 2 can be realized by shifting operation. In step 2.2, the X_{i} ∗ Y can be implemented by a multiplexer. Step 2.3 and step 2.4 require an adder unit, respectively. Therefore, if each iteration is completed within one clock cycle, then a total of two adder units are required. After the iteration of step 2, the result is limited to [0, 2p − 1]. Therefore, it is necessary to go to step 3 to limit R to [0, p − 1].



3.3. Modular Inversion
In addition to the MM operation, the modular inversion (MI) operation also plays an extremely important role in the SM operation architecture. In MI operation, the same two adder units are reused to reduce hardware consumption. This paper adopts the binary modular inversion algorithm proposed in [10]. Algorithm 3 can calculate MM and MI operations in the same clock cycle. If the input a = 1, it is an MI operation, and mod p. In step 1, the variables u, , r, s are initialized. In step 2 and step 3, the /2 operations can be realized by right shifting one bit. With a positive or negative odd r, , that is, it can be computed by adding r to p and then shifting right. The same is true in other situations. The above operations require one adder unit. In step 2 or step 3, the comparison between u and in step 4 is calculated in advance, which requires two adder units. In step 4, or is calculated, which requires two adder units. Step 5 requires a total of two adder units, one of which is used to determine whether r or s is less than 0, and the other is used for or . Therefore, if each step is completed in one clock cycle, two adder units are required.
3.4. Point Addition and Point Doubling
Algorithm 4 provides PA and PD operations. Since modular operations (MA, MS, MM, and MI) share the same two adder units, only one modular operation is computed at a time. A total of eight registers are required, of which six are used for PA and PD operations of and two for integer and prime .
3.5. Scalar Multiplier Architecture
In this part, Figure 1 shows the scalar multiplication architecture of SM on GF(p), which achieves the modular operations of MM, MS, MA, and MI as well as the point operations of SM, PA, and PD. Among them, point controller block is the main state machine that realizes the point operation, and modular controller block is the state machine that realizes the modular operation.
4. Implementation and Result
The ECC architecture described in this part is designed using VerilogHDL language and adopt Design Compiler to synthesize it using SMIC 130nm CMOS standard cell library. In addition, the experimental circuit area is evaluated by the 2way NAND gate.
The source of the experimental simulation parameters is the FIPS 1862 standard [8]. Figure 2 lists the main parameters for one 256 bit elliptic curve on the prime field GF(p) and the other bit elliptic curve can be found in the FIPS 1862 standard. The coordinates of base point G on elliptic curve are Gx and Gy.
There is a need for a total of two adders and twelve data registers in the proposed architecture. According to Table 1, the required registers and adders consumed 42% of the hardware. Among them, the twelve registers are used for data storage. With the increase of field order, the adder’s resource consumption percentage increases from 13.72% to 15.54%.
In Table 2, the results of the implementation and comparison of the proposed architecture are shown. By testing 100 times, the SM operation requires an average of 186, 268, 364, and 475 clock cycles on 160, 192, 224, and 256 prime fields, respectively. The proposed architecture takes 1.24, 1.78, 2.42, and 3.16 ms with the 35.65 k, 43.25 k, 49.41 k, and 59.14 k gate area for one SM operations over 160, 192, 224, and 256 prime fields, respectively.
The authors of [10, 11] use IMM algorithm and BIA to realize the inversion and multiplier units. Among them, the processor in [10] and the processor we proposed use the same method, that is, use the same unit to implement MM and MI operations. But in contrast, the proposed design has higher performance on the prime fields of 160/192/224/256 bits, which is 1.28∼1.29 times faster than that of [10]. Under 160 bit prime field, the processor in [10] takes 35.43k gate area and 1.60 ms to perform an SM operation. In the areatime product (AT) parameter, the AT value of the processor we designed is relatively low, indicating that there is a better balance between hardware consumption and performance.
The processor in [11] uses two adderbased inversion units and two adderbased multiplier units, and our processor uses one combined unit. In contrast, the proposed processor has the advantage of low hardware consumption. In addition, our design saves 64.81%, 64.87%, 65.66%, and 64.69% area over the 160/192/224/256 bit prime fields than the design in [11]. Taking the 160 bit prime field as an example, the processor in [11] takes 101.3 k gate area and 0.87 ms. Although it has higher performance, the design we propose chooses a lower AT value in order to balance hardware consumption and performance. In summary, the proposed processor has the advantages of low hardware consumption and high hardware efficiency.
The processor in [13] uses a wordbased Montgomery multiplier and dynamic redundant binary converter, which can improve the performance of SM. Compared with the design in [13], our design can save 69.66%, 63.35%, 58.93%, and 50.84% area over the 160/192/224/256 bit prime fields.
The processor in [14] causes large power consumption, which is not suitable for IoT devices. More specifically, a fullsize 256 bit × 256 bit multiplier requires a large hardware consumption, namely, 659 k gate. In contrast, the proposed design can save 91.03% of the area. The processor in [15] uses a systolic arithmetic unit in high frequency of 556 MHz. Based on the 256 bit prime fields, our design can save 51.52% of the area.
Compared with the abovementioned processors in [10, 11, 13, 14] and [15], our proposed processor has the least hardware consumption.
5. Conclusion
By constructing a bottomup optimization for all operations of algorithmlevel scalar multiplication on the basis of two fullword adders, a hardwareefficient elliptic curve processor over GF(p) is proposed. Through the improvement of IMM algorithm and BMI algorithm, they become suitable for two adder units. Moreover, the registers are also optimized. A total of 12 fullword register units are used to store data. Synthesized on 0.13 µm ASIC platform, the processor’s hardware consumption can be controlled within the range of 35.65 k∼59.14 k, which is far lower than most processors.
Data Availability
The raw/processed data required to reproduce these findings cannot be shared at this time as the data also form part of an ongoing study.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the Science and Technology Planning Project of Guangdong Province of China (nos. 2015B010128013, 2015B090911001, and 2015B090912001).