Abstract
The higher computational complexity of an elliptic curve scalar point multiplication operation limits its implementation on general purpose processors. Dedicated hardware architectures are essential to reduce the computational time, which results in a substantial increase in the performance of associated cryptographic protocols. This paper presents a unified architecture to compute modular addition, subtraction, and multiplication operations over a finite field of large prime characteristic . Subsequently, dual instances of the unified architecture are utilized in the design of high speed elliptic curve scalar multiplier architecture. The proposed architecture is synthesized and implemented on several different Xilinx FPGA platforms for different field sizes. The proposed design computes a 192bit elliptic curve scalar multiplication in 2.3 ms on Virtex4 FPGA platform. It is 34 faster and requires 40 fewer clock cycles for elliptic curve scalar multiplication and consumes considerable fewer FPGA slices as compared to the other existing designs. The proposed design is also resistant to the timing and simple power analysis (SPA) attacks; therefore it is a good choice in the construction of fast and secure elliptic curve based cryptographic protocols.
1. Introduction
Elliptic curve based cryptography (ECC) proposed independently by Miller [1] and Koblitz [2] has established itself as a proper alternative to the traditional systems such as Ron Rivest, Adi Shamir, and Leonard Adleman (RSA) [3]. The National Institute of Standards and Technology (NIST) recommended 256 bits of key lengths for ECC to achieve the same level of security as 3072 bits of RSA.
Due to the fact that ECC offers similar security with considerable smaller key sizes than RSA, it has been standardized by IEEE and NIST [4]. Thus, as the result of smaller key sizes, its implementation led to substantial reduction in power consumption and storage requirements and offers potentially higher data rates. These inherent properties rank it as a strong candidate for providing security in resourceconstrained devices. Unfortunately, due to the underlying complex mathematical structure, its implementation on generalpurpose processors (GPP) struggles to meet the speed requirements of many realtime applications.
Thus, several new implementation platforms have been explored during the last years. Field programmable gate array (FPGA) has been established as a proper platform for implementation of security algorithms such as ECC and RSA. Its shorter design cycle time, lower design cost, and its reconfigurability make it more attractive than other platforms, such as Application Specific Integrated Circuits (ASICs).
Elliptic curve scalar point multiplication is the central and most time consuming operation in all ECC based schemes. Its efficient implementation on various platforms is very critical. It is achieved by manipulating points on a properly chosen elliptic curve over a finite field. Mathematically, it is expressed as , where is a base point, is an integer value, and is the resultant point of multiplication of and . For example, it can be achieved by adding to itself () times. The strength of any ECC schemes is based on the computational hardness of finding given and known as Elliptic Curve Discrete Logarithm Problem (ECDLP).
There are several elliptic curve representations satisfying different performance and security requirements. A flexible design capable of supporting different values for elliptic curve parameters and a prime is more demanding. The ECDLP is not the only way of finding scalar ; it can also be revealed by monitoring the timing [5] and power consumption of cryptographic devices known as side channel attacks (SCAs) [6]. The simplest SCAs are based on the timing and simple power consumption analysis (SPA). Detailed surveys on known SCAs, countermeasures, and secure ECC implementations are reported previously in [7, 8].
Elliptic curve scalar point multiplication involves many basic modular arithmetic operations such as addition, subtraction, multiplication, inversion, and division. Hence, optimization of these operations can significantly improve the performance of ECC schemes.
Elliptic curve cryptosystems can be designed on a finite field either with prime characteristics or with binary characteristics . The arithmetic is easier to implement in hardware than because of carryfree arithmetic. However, field parameters in are mostly fixed and are not very flexible. Some efficient ECC implementations over are presented in [9–14]. A very good survey of high speed hardware implementations of ECC has been reported in [15].
Several hardware based elliptic curve processors over have also been proposed in the literature [5, 16–26]. The design reported in [21] proposed two architectures to speed up the EC point multiplication operation. Both these architectures are based on incorporating parallel dedicated hardware units to compute arithmetic operations such as addition, subtraction, multiplication, and division over . The multiplication unit [21] is based on a bitserial interleaved multiplication while, for a division over , a dedicated hardware unit based on a binary version of the extended Euclidean algorithm is used. Ghosh et al. proposed a speed and area optimized architecture for EC point multiplication by exploiting a concept of shared hardware arithmetic over [20]. The saving in area is achieved by sharing hardware resources among different arithmetic operations, while multiple copies of the arithmetic units are used to speed up EC point multiplication.
1.1. Contribution
Modern FPGAs have dedicated builtin arithmetic components (dedicated multipliers, block RAMs, etc.) to perform different signal processing tasks efficiently. However, in this work these components are not used due to the limitations of the adopted technique to perform a modular multiplication, that is, Interleaved Multiplication (IM) algorithm [27], which interleaved the reduction step by reducing each partial product. To the best of authors knowledge, no work has been reported targeting a digitwise implementation of the IM technique. However, available smallsized dedicated multipliers inside an FPGA can be very effective in case of the Montgomery multiplication [28] and the NIST recommended primes [29]. A modular multiplication using these methods can be performed by integers multiplication followed by a modular reduction.
This paper presents a novel architecture to speed up the EC point multiplication in affine coordinates. The proposed design is based on a unified adder, subtractor, and multiplier (Add/Sub/Mul) unit. The unified Add/Sub/Mul unit is an extension of our previous multiplier design reported in [30]. The proposed unified unit in this work performs modular addition and subtraction in a single clock cycle, while modular multiplication is performed in clock cycles, where . The careful FPGA implementation of the proposed EC point multiplication architecture outperforms the other existing designs. The main advantages of the proposed design are as follows.(i)It reduces the number of required clock cycles and computation time of EC point multiplication to almost and , respectively, with considerably smaller FPGA area consumption. The reduction in clock cycles and computation time is mainly due to the proposed multiplier [30].(ii)Furthermore, the adopted algorithm for EC point multiplication with careful implementation of arithmetic primitives is capable of resisting the timing and SPA attacks [5].(iii)It is flexible; all parameters (curve parameter , EC point , scalar value , and the prime value ) can be easily changed without FPGA reconfiguration.
This paper is organized as follows. Section 2 briefly explains EC group operations such as EC point addition and EC point doubling. In addition, this section also describes the Montgomery ladder structure for the EC point multiplication algorithm. The unified Add/Sub/Mul unit over is presented in Section 3. Section 4 proposes a novel architecture for EC point multiplication based on the unified Add/Sub/Mul unit. Implementation results and performance evaluation are presented in Section 5, and finally the paper is concluded in Section 6.
2. Elliptic Curve Group Operations
In this paper, we consider an elliptic curve , defined over a prime field , where is a large prime characteristic number. Field elements are represented as integers in the range []. An elliptic curve over in short Weierstrass form is represented aswhere, , , , and and (modulo ). The set of all points that satisfies (1), plus the point at infinity, makes an abelian group. EC point addition and EC point doubling operations over such groups are used to construct many elliptic curve cryptosystems. The EC point addition and EC point doubling operations in affine coordinates can be represented as follows: let and be two points on the elliptic curve. The group operation is the point addition, , which is defined by the group law and is given aswhere
If , then a special case of adding a point to itself is called EC point doubling operation. In affine coordinates the EC point addition requires one division, two multiplications, and six addition or subtraction operations, whereas the EC point doubling can be performed by using one division, three multiplications, and seven addition or subtraction operations. Therefore, optimization of these operations impacts significantly on the overall performance of the EC point multiplication operation.
2.1. Elliptic Curve Scalar Multiplication
EC cryptosystems are mostly based on the EC point multiplication operation. This operation can be performed as a sequence of EC point addition and EC point doubling operations given in Algorithm 1, which is known as the Montgomery ladder for EC point multiplication. Algorithm 1 works on the binary representation of and it is assumed that the most significant bit is equal to 1. The EC point addition and EC point doubling operations are not dependant on the bit pattern of , so these operations can be performed in parallel. As these can be executed concurrently, therefore Algorithm 1 gives an extra feature of protection against the timing and simple power analysis (SPA) attacks.

3. Unified Add/Sub/Mul Unit
In this section we present a unified modular adder, subtractor, and multiplier (unified Add/Sub/Mul) unit. This unit is capable of performing modular addition, subtraction, and multiplication operations and supports any prime ; therefore it is able to provide hardware support for ECC over a variety of elliptic curves. Normally, to achieve a better performance of EC point multiplication on dedicated hardware, multiple copies of adder, subtractor, multiplier, and divider units are integrated. These multiple copies can help to execute several operations in parallel at the expense of area and cost, which can also result in more power consumption. Our objective is to accelerate the computation of EC point multiplication operation with minimum number of dedicated arithmetic units. Modular multiplication is a critical component in the architecture of EC point multiplication operation. In this regard, several modular multipliers have been proposed. The design reported in [19] is based on an iterative addition and reduction algorithm. In every iteration addition and reduction modulo of partial products are performed. It computes a bit modular multiplication in clock cycles. Two novel architectures based on radix4 and radix8 Booth encoding techniques are reported in [30, 31].
In [30] the radix4 Booth encoded version computes a modular multiplication operation in clock cycles, whereas the radix8 Booth encoded multiplier takes clock cycles. The radix8 Booth encoded multiplier given in Algorithm 2 is based on an iterative addition and reduction modulo of partial products technique proposed by Blakley reported in [27]. The two main components in the design are as follows:(i)Threebit left shift modulo unit (Step (3)).(ii)Addition and subtraction modulo unit (Steps (7), (9), (11), and (13)).There is also a logic circuit for Booth encoding in addition to these two core components. The presented unified Add/Sub/Mul unit is based on the same design. The radix8 Booth encoded modular multiplier design has a modular adder/subtractor unit. Hence this paper modified the radix8 Booth encoded modular multiplier design in such a way that it becomes capable of performing modular addition and subtraction operations in addition to its main task, that is, a modular multiplication operation. Due to the proposed modification dedicated hardware units for modular addition and subtraction operations are not needed.

The toplevel block of unified Add/Sub/Mul unit is shown in Figure 1. The whole logic components of the radix8 Booth encoded modular multiplier are mainly divided into shared and unshared logic parts. The shared logic components can be shared to perform modular addition, subtraction, and multiplication operations, whereas the unshared logic components are only dedicated to a modular multiplication operation. A control unit is responsible for decoding instructions on the basis of two bits of operational code (opcode) and generates appropriate signals for the shared and unshared logic parts.
The shared logic is comprised of a modular adder/subtractor unit while the unshared logic consists of threebit left shift modulo unit and Booth encoding logic. The adder/subtractor and threebit left shift modulo units are shown in Figure 2. The threebit left shift modulo unit is comprised of three identical D1 units cascaded in series. Each D1 unit performs a single bit left shift modulo operation and it consists of one bit adder and a multiplexer. Hence, in total, the unshared logic consists of three bit adders, three multiplexers, and an additional logic for Booth recoding. The adder/subtractor unit consists of two bit adders and five multiplexers.
In the proposed unified Add/Sub/Mul unit, these hardware logic resources are shared with other resources, so two bit adders and five multiplexers are saved. This unit is not capable of performing modular addition, subtraction, and multiplication operations in parallel. However, EC point representation in affine coordinates has a very limited scope of parallelism. Therefore, the proposed unified Add/Sub/Mul unit can increase the performance of EC point multiplication in affine coordinates with a lower area overhead. The proposed unified Add/Sub/Mul unit performs modular addition, subtraction, and multiplication operations as given in Table 1 in the following manner.
A addition is performed by the shared logic unit, if the twobit input opcode = 00. The control unit decodes the opcode and activates the shared logic block; that is, the adder/subtractor unit and sets . The adder/subtractor unit consists of two bit adders and logic for input output multiplexing shown in Figure 2. The first bit adder performs addition of operands and the result is fed into the second bit adder where a modulus is subtracted from it. Similarly, a subtraction is performed by the same unit by setting opcode = 01; the first bit adder performs subtraction followed by the addition of a modulus . The result of modular addition and subtraction becomes available at port after a single clock cycle. In the case of multiplication indicated by opcode = 10, the control unit generates appropriate signals for the shared and unshared logic components. Partial products addition or subtraction () is computed by the shared logic components depending on cin signal generated by the Booth recoding logic, while threebit left shift modulo () operation is computed by the unshared logic components. The detailed execution procedure and control signals for both shared and unshared logic components are given in [30]. The unified Add/Sub/Mul architecture takes clock cycles to produce a multiplication result at port . The main advantages of the proposed unified Add/Sub/Mul units are a single unit that can handle addition, subtraction, and multiplication instructions. It eliminates a need for dedicated hardware units for addition and subtraction operations, which consumes two bit adders in addition to I/O multiplexers. The proposed unit is not only optimized for hardware resources and required number of clock cycles for multiplication operation, but it is also programmable and supports any value for a modulus .
4. Elliptic Curve Scalar Multiplier Architecture
ECC based schemes heavily rely on the EC scalar multiplication operation; therefore, its efficient implementation can greatly improve the performance of associated ECC based protocols.
The EC scalar multiplication is the computation of operation, where is an integer and is a base point of a chosen elliptic curve. Several algorithms have been proposed to compute the EC scalar multiplication operation [29]. Standard doubleandadd, nonadjacent form (NAF), and a Montgomery ladder for EC point multiplication are mostly used. EC point addition and EC point doubling operation can be executed in parallel using a Montgomery ladder method given in Algorithm 1. As these EC point operations are not dependant on the respective scalar bit , hence, power consumptions of these operations are symmetric and it is not possible for an attacker to extract any information regarding a secret value . Therefore, this technique provides a protection against simple power analysis attacks. This section presents an efficient architecture for EC scalar multiplication in affine coordinates based on the proposed unified Add/Sub/Mul unit in Section 3. The proposed EC scalar multiplier architecture executes a scalar multiplication as a sequence of EC point addition and EC point doubling operations. These EC point operations can be achieved as a sequence of arithmetic operations as given in Table 2.
The EC point addition operation requires six subtraction, two multiplication, and one division operations. On the other hand, three addition, four subtraction, two multiplication, and single division operations are required to perform EC point doubling operation. As depicted in Table 2, the EC point operations in affine coordinates also require division operation in addition to addition, subtraction, and multiplication operations. A division and inversion can be performed either by Fermat little theorem or by Extended Euclidean algorithm (EEA). The binary version of EEA given in [29] is the mostly adopted algorithm for division. The EEA implementation in this work is based on the guidelines presented in [34]. It takes 2 clock cycles to perform a bit division or inversion operation.
It is evident from Table 2 that, in the computation of EC point operations, a scope of parallelism among arithmetic operations is very limited. Therefore, a semiparallel architecture for EC scalar multiplication is shown in Figure 3.
It consists of two unified Add/Sub/Mul units, two divider units, two register files (each comprised of 3 bit registers), I/O multiplexers, and a main controller. The unified Add/Sub/Mul unit executes a addition, subtraction, or multiplication operations at a time, while division unit executes division ( modulo ) operation in 2 clock cycles. Therefore, the proposed design can execute two addition, subtraction, or multiplication operations in parallel to two division operations. We grouped these arithmetic units into SAU1 and SAU2 units. Each SAU1 and SAU2 consists of one unified Add/Sub/Mul unit, one divider unit, and one register file. The EC point addition operation and EC point doubling operation in Algorithm 1 can be performed in parallel. Therefore, the proposed architecture performs these EC point operations in parallel; however, on the unified Add/Sub/Mul unit, addition, subtraction, and multiplication operations can only be performed in a serial fashion. The SAU1 unit is dedicated to perform the EC point addition operation, while the EC point doubling operation is executed by SAU2 unit. The register files store intermediate results during execution of EC point addition and EC point doubling operations based on control signals generated and managed by the main controller.
4.1. Scheduling Strategy
A scheduling policy to compute EC point addition and EC point doubling operations on the proposed SAU1 and SAU2 units is shown in Figure 4, where addition, subtraction, multiplication, and division operations are denoted as +, −, , and , respectively. Coordinates of two input points , are denoted by , , , , while resultant point coordinates are shown as , . The results of + and − operations are available after one clock cycle, whereas and operations are completed in and clock cycles, respectively. The register transfer logic of EC point addition and EC point doubling operations on SAU1 and SAU2 units can be analyzed using Figure 4 and Table 2. Initially registers , , , and are loaded with coordinates of EC input points and , while register is initialized with the EC parameter . The computation of EC point addition on the proposed SAU1 unit is completed in clock cycles, whereas SAU2 unit takes number of clock cycles to execute EC point doubling operation. Therefore, a single iteration of Algorithm 1 is completed in clock cycles and registers , , , and are updated with new values of EC point addition and EC point doubling. Let be the total number of required clock cycles to compute the EC point multiplication operation; then on the proposed architecture it can be estimated as
4.2. Main Controller Logic
The main controller is shown in Figure 5; it is based on a finite state machine (FSM) logic comprised of six states. The control unit is responsible for generating appropriate control signals required to execute EC point addition and EC point doubling on the proposed SAU1 and SAU2 units according to the scheduling strategy shown in Figure 4. It waits for the respective done signals, checks the th bit of scalar , and either decides to update the register files with new values or outputs the result and stops execution.
5. Implementation Results and Discussion
The elliptic curve scalar multiplier architecture presented in the previous section has been implemented in Verilog HDL. For simulation, synthesis, mapping, and routing purposes Xilinx ISE 9.1 design suite has been used.
Table 3, shows the performance of the proposed architecture for 160, 192, 224, and 256 bits field sizes on several different FPGA platforms. It takes 3.2 ms, 2.3 ms, and 1.4 ms while running at a maximum frequency of 35 MHz, 48 MHz, and 81 MHz for 192bit implementation on VirtexII pro, Virtex4, and Virtex6 FPGA platforms, respectively. As, ISE 9.1 design suit does not have a support for Virtex6 FPGA, so implementation on Virtex6 FPGA has been done using Xilinx ISE 14.7.
For 192bit field size our implementation on Virtex4 computes a single EC scalar multiplication in 2.3 ms in 113,472 clock cycles running at a maximum frequency of 48 MHz. The 192bit implementation consumes 8,500 slices of Virtex4 FPGA and has a throughput of 83.5 Kbps. The same design on VirtexII pro takes 3.2 ms at a maximum frequency of 35 MHz and it uses 7,930 slices. Performance comparison among the proposed architecture and other FPGA implementations is analyzed on the basis of clock cycles, computation time, frequency, occupied FPGA slices, and throughput (TP).
Table 4 shows the required number of clock cycles to compute the EC scalar multiplication operation. The proposed design computes EC point addition and EC point doubling operations in and clock cycles, respectively. As in the proposed design EC point operations are executed concurrently; therefore a single iteration of Algorithm 1 is completed in clock cycles. The designs reported in [21] take clock cycles, which is almost 40% more than the proposed design. Similarly, [18, 24–26] require 48%, 179%, 85%, and 62% more clock cycles to perform the EC scalar multiplication, respectively.
Table 5 demonstrates performance analysis of the several existing FPGA based implementations of EC scalar multiplier. The design reported in [21] is based on parallel dedicated hardware units for addition, subtraction, multiplication, and division. It computes a 192bit EC scalar multiplication in 3.5 ms at a maximum frequency of 53 MHz on the Virtex4 platform. On the same platform the proposed design is 34% faster and requires 39% fewer clock cycles with 40% lower FPGA slice consumption as compared to [21]. The proposed design completes a 160bit EC point multiplication in 79,200 clock cycles at a maximum frequency of 40 MHz. It consumes 6,492 VirtexII pro FPGA slices. Embedded multicore design reported in [32] computes 192bit EC scalar multiplication in 9.9 ms running at a maximum frequency of 93 MHz and consumed 3,173 VirtexII pro FPGA slices. It also uses 6 block BRAMs (BRAM) and sixteen 18 × 18bit embedded multipliers. Compared to our design, it is 210% slower, but it consumes 149% fewer FPGA slices if we ignore the slices for BRAM and dedicated embedded multipliers. The design presented in [33] consumes 15,775 slices and takes 5.99 ms to compute one EC scalar multiplication. On the same platform it is 25% faster but it consumes 28% more FPGA slices. The design proposed by Daly et al. in [18] is 262% slower but it consumes 47% lower slices. The design reported in [20] is 40% slower and consumes 13% more FPGA slices as compared to the proposed design.
Performance comparison on the basis of throughput rate is depicted in the last column of Table 5. The proposed design has 0.5 times, 2.64 times, 1.30 times, 2.1 times, and 0.42 times higher throughput rate as compared to the designs [21], [18], [16], [32], and [20], respectively. The design [33] has 1.25 times higher throughput rate as compared to our design; however, it consumes 1.42 times more FPGA slices. Therefore, our design is better in terms of the computation time, slice area, and throughput rate as compared to all the designs listed in Table 5. As the proposed design executed EC point addition and EC point doubling operations concurrently in a fixed amount of time , therefore, it provides a protection against the timing and simple power analysis attacks, which is an important feature in modern day security applications. Due to the lower computation time and high throughput rate it is suitable for network applications like SSL and IPsec. It is also suitable in the low power resourceconstrained environments because of the smaller area and reduced clock cycles.
6. Conclusion
This paper first introduces a unified arithmetic architecture for addition, subtraction, and multiplication operations. Then, a high speed elliptic curve scalar multiplier is developed on the basis of the unified arithmetic architecture. The proposed design has been synthesized using Xilinx ISE 9.1 and 14.2 Design Suites targeting various Xilinx FPGA devices. Performance is shown for 160, 192, 224, and 256bit elliptic curve scalar multiplication operation. Compared with other contemporary designs, it gives 34% and 40% better performance in terms of computation time and number of required clock cycles, respectively. It is programmable for any value of prime and is also resilient to timing and simple power analysis attacks. Therefore, it is a good choice in ECC based cryptosystems.
Competing Interests
The authors declare that there are no competing interests regarding the publication of this paper.
Acknowledgments
This work is supported by the Telecommunication Graduate Initiative (TGI) scheme funded by the Higher Education Authority (HEA), Ireland.