Recent Advances in Security and Privacy for Wireless Sensor Networks 2020
View this Special IssueResearch Article  Open Access
Miguel MoralesSandoval, Luis Armando Rodriguez Flores, Rene Cumplido, Jose Juan GarciaHernandez, Claudia Feregrino, Ignacio Algredo, "A Compact FPGABased Accelerator for CurveBased Cryptography in Wireless Sensor Networks", Journal of Sensors, vol. 2021, Article ID 8860413, 13 pages, 2021. https://doi.org/10.1155/2021/8860413
A Compact FPGABased Accelerator for CurveBased Cryptography in Wireless Sensor Networks
Abstract
The main topic of this paper is lowcost public key cryptography in wireless sensor nodes. Security in embedded systems, for example, in sensor nodes based on field programmable gate array (FPGA), demands low cost but still efficient solutions. Sensor nodes are key elements in the Internet of Things paradigm, and their security is a crucial requirement for critical applications in sectors such as military, health, and industry. To address these security requirements under the restrictions imposed by the available computing resources of sensor nodes, this paper presents a lowarea FPGAprototyped hardware accelerator for scalar multiplication, the most costly operation in elliptic curve cryptography (ECC). This cryptoengine is provided as an enabler of robust cryptography for security services in the IoT, such as confidentiality and authentication. The compact property in the proposed hardware design is achieved by implementing a novel digitbydigit computing approach applied at the finite field and curve level algorithms, in addition to hardware reusing, the use of embedded memory blocks in modern FPGAs, and a simpler control logic. Our hardware design targets elliptic curves defined over binary fields generated by trinomials, uses fewer area resources than other FPGA approaches, and is faster than software counterparts. Our ECC hardware accelerator was validated under a hardware/software codesign of the DiffieHellman key exchange protocol (ECDH) deployed in the IoT MicroZed FPGA board. For a scalar multiplication in the sect233 curve, our design requires 1170 FPGA slices and completes the computation in 128820 clock cycles (at 135.31 MHz), with an efficiency of 0.209 kbps/slice. In the codesign, the ECDH protocol is executed in 4.1 ms, 17 times faster than a MIRACL software implementation running on the embedded processor Cortex A9 in the MicroZed. The FPGAbased accelerator for binary ECC presented in this work is the one with the least amount of hardware resources compared to other FPGA designs in the literature.
1. Introduction
Nowadays, the computing paradigm of Internet of Things (IoT) is enabling a large number of applications in wireless technologies such as smart vehicles, smart buildings, health monitoring, energy management, environmental monitoring, food supply chains, and manufacturing [1].
In critical IoT applications, as in the Industrial Internet of Things (IIoT) or in healthcare (Medical Internet of Things—MIoT), embedded system devices have become an integral part [2] and easy targets of attacks, mainly because they are physically more accessible. Cyberphysical systems in these domains create new classes of risks resulting from their interaction between cyberspace and the physical world. Wireless sensor networks (WSN) are the cornerstone for realizations of IoT applications, where in some cases, the data generated, stored, or transmitted by the nodes (i.e., embedded systems) require robust security mechanisms to provide them with security services of confidentiality, authentication, integrity, and nonrepudiation. Consider the model for a set of networked IoT devices (for example, a wireless sensor network) in Figure 1. Security risks arise since a malicious node can get unauthorized access to (sensible) data, maliciously alter data, and impersonate legitimate nodes, thus posing threats to confidentiality and authentication in the communication path between a sender and a receiver node.
A robust approach to provide such security services in the IoT domain is the public key cryptography (PKC). PKC in its different families is based on mathematical problems, and underlying realizations involve costly arithmetic algorithms over finite fields, rings, or groups. In the literature, a vast amount of research has focused in hardware acceleration of PKC at the different levels of involved arithmetic algorithms. The main approaches for hardware implementations of PKC have focused on speeding up the underlying group and finite field operations at the expense of a high amount of hardware resources. However, the main drawback with hardware for PKC in WSN is the long key lengths which amount to large chip area, circuit delays, and increased power dissipation [3].
The hardware implementation of PKCbased security solutions in resourceconstrained devices typically found in IoT scenarios, as in FPGAbased sensor nodes, and using a straightforward approach is not viable. Lightweight cryptography (LWC) [4] has emerged as an active research line focused on designing cryptographic primitives, schemes, and protocols tailored to constrained devices as sensor nodes in WSN or other IoT devices, for example, RFID tags [5]. For the case of PKC, elliptic curve cryptography (ECC) has been considered one of the most efficient realizations well suited for constrained environments in the IoT [6].
Applicationspecific integrated circuits (ASICs) were the first targets in LWC [4, 7]. However, reconfigurable logic circuits, specifically field programmable gateway arrays (FPGAs), are being more popular to implement compact/lowarea hardware accelerators for cryptography algorithms, with attractive advantages for the IoT domain [8]. At the beginning, FPGAs were frequently used as devices for rapid prototyping of cryptographic algorithms, but now they are commonly used as final product platforms [9]. Furthermore, FPGAs are not only used as single parts of embedded systems but rather as systemonchip (SoC) platforms for implementing complete applications [10]. Modern, commercial FPGA devices contain not only programmable hardware resources but large functional blocks, such as highspeed multipliers, embedded multiport memories, and even programmable processor cores, thus enabling hardware/software codesigns where the critical parts of algorithm, protocol, or application are accelerated with custom designs implemented in the available programmable hardware, and the rest of the application is executed by the general purpose processors. The main advantage of FPGAs is reconfigurability since, for example, a whole system could be upgraded (or partial reconfigured) [7].
Recent works propose FPGAs as the most attractive candidates to a large range of IoT applications because of their high energy efficiency and low cost, for example, for IoT machine learning [11], IoT neural networks [12], IoT vehicle monitoring systems [13], IoT security (cryptography) [14], and among other applications. Not only research papers propose FPGAs as hardware modules for IoT scenarios but also FPGA vendors are producing devices with specific features for IoT development [15].
Contribution: in this work, we aim at approaching lowarea hardware engine to ECC for IoT security, suitable for being included as a building block in FPGAbased sensor nodes for IIoT or MIoT. We aim at providing one of the most compact FPGA hardware accelerator for the scalar multiplication in binary standard curves, the most time consuming operation, and the core of ECC cryptographic schemes such as encryption, digital signatures, and key establishment. To achieve compactness, a novel digitdigit binary finite field multiplier is proposed and used as the basic building block of the proposed ECC accelerator. Under this approach, the operands are processed one digit at a time in an iterative way, but exploiting the parallelism at the algorithmic level and reusing hardware resources as much as possible. The sequence of field operations in the algorithm for scalar multiplication is carefully scheduled to reduce the number of field multiplier cores (two) and memory blocks (eight). While the field multipliers are implemented using standard FPGA logic, memories are taken from the ones available in modern FPGAs. Due to the digitdigit computation approach, an efficient data memory management is designed to reduce the number of memory block. This way, with only the eight memory blocks, the several field multiplications in a single point addition are correctly computed, and at the same time, those same memories serve to keep the progress of the scalar multiplication computation. The novel hardware design presented in this work was validated under a hardware/software implementation of elliptic curve DiffieHellman (ECDH) key exchange protocol, tailored to the MicroZed FPGA prototyping board, recommended for IoT industrial applications. Under this setting, which is very common in an FPGA IoT application, the execution of ECDH outperforms the software counterpart, implemented using the MIRACLE library and runs in the embedded Cortex A9 processor in the MicroZed. Our hardware architecture, compared with stateoftheart similar approaches in terms of area, only requires up to 16% of FPGA hardware resources, thus being the most compact FPGAbased hardware architecture for computing scalar multiplications in ECC defined over binary fields. Compared to the software reference implementation, our design is 17 times faster.
The rest of this brief is organized as follows: Materials and Methods discusses the preliminaries of scalar multiplication in binary elliptic curves and the Montgomery LópezDahab algorithm for scalar multiplication. This section also describes related works and the proposed hardware design. Results and Discussion presents the experimental results and comparisons with stateoftheart works, followed by concluding remarks in the Conclusion.
2. Materials and Methods
First, we provide the mathematical concepts and foundations that are the basis to construct the FPGAbased ECC cryptoengine. First, we present the basis of elliptic curves and groups from which the scalar multiplication is defined. Scalar multiplication is critical because the proposed hardware cryptoengine is precisely to speed up this costly operation and the core of higher operations for security applications such as encryption and digital signatures. Finally, the section concludes discussing the method to compute scalar multiplications on binary elliptic curves. This algorithm is realized by the proposed FPGAbased ECC cryptoengine.
2.1. Elliptic Curves and Its Use in Cryptography
Since invented independently by Miller [16] and Koblitz [17], elliptic curve cryptography (ECC) has received a lot of attention in the academy and industry. Elliptic curves and their properties have enabled also other types of cryptography relevant for the IoT (in wireless sensor networks), for example, identitybased encryption (IBE) [18] and attributebased encryption [19]. With the advent of the IoT, mainly plagued by intelligent object with restricted computing and resources capabilities, ECC is becoming one of the promising approaches to provide security services in that computing paradigm [6].
An elliptic curve over a finite field is defined by Eq. (1). where . The pairs satisfying , together with a special point named point at infinity , form a group with point addition as the group operation. is a cyclic group with prime order where the discrete logarithm problem is defined and on which ECC is founded.
It is well known that binary extension fields () are very attractive for defining ECC. An element in is the bit vector that in polynomial basis represents the degree polynomial , with in {0,1}. Arithmetic in in polynomial basis is polynomial arithmetic with reduction modulo, which is an irreducible polynomial of degree , . The arithmetic in is carry free and more suitable for hardware implementations.
2.2. Scalar Multiplication in Elliptic Curves
Scalar multiplication in denoted as with and is the main and most timeconsuming operation in any ECC scheme (encryption, digital signature, keys exchange, etc). is computed by times point addition operations of with itself [20]: .
The complexity of is in terms of the operations in . Given a large integer and a point in , it is easy to compute . On the contrary, the elliptic curve discrete logarithm problem (ECDLP) is the problem that given the point and in , to find the scalar . For an enough large , ECDLP becomes hard to solve. Most of the stateoftheart works related to ECC have been focused on the efficient implementation of scalar multiplication [6], which is a condition for efficient ECC implementation.
The LopezDahab Montgomery PM algorithm [21], shown in Algorithm 1, has been commonly used for the computation because it is sidechannel attackresistant, suitable for parallelization and low resource friendly. In this work, we use the LopezDahab algorithm for implementing for the first time the most compact FPGAbased hardware architecture for computing in binary elliptic curves, .
The main operations in Algorithm 1 are addition, multiplication, and squaring in . Consider the fields recommended by NIST for practical ECC, with and . For , 2.2 will have a cost of 1227 field additions, 2454 field multiplications, and 2454 field squarings over , being field multiplication the most timeconsuming operation.
The Lopez Dahab’s method for scalar multiplication in ECC is considered as the most suitable method when targeting low computing powered devices [22]. The elliptic curve point is represented in projective coordinates. At the beginning, the elliptic curve point in affine coordinates () is converted to its projective representation . Algorithm 1 uses the coordinate only for point representation so storage resources can be saved (line 5). With this setting, costly field inversions are avoided in each group (curve level) operation. Only one field inversion is required for coordinate conversion from projective to affine at the end of the main loop (line 13). Algorithm 1 is timeconstant and resistant to some sidechannel attacks such as simple power analysis (SPA).

2.3. Related Work
Being the core operation in ECC cryptographic schemes, that operation has been the main target for hardware accelerations; however, few works have approached lowarea designs compared to those trying to achieve the maximum performance. However, for the devices used in the IoT, generally sensor nodes, lightweight realizations of cryptography are better preferred to efficiently use the available computing and power resources in the sensor nodes [23].
The computation of implies to execute an scalar multiplication algorithm, being Algorithm 1 one of the most recommended. At each iteration, curve (group) arithmetic is executed, either point addition or point doubling, each implying several finite field operations. So, operations in groups and finite fields are critical for public key cryptography as in elliptic curve cryptography (ECC). An efficient implementation of requires an efficient implementation of finite field operations, being multiplication and inversion the most time consuming field operators. Field inversion can be efficiently realized through several field multiplications; consequently, hardware field multiplier has been studied as the main core to compute .
In the case of , there are three main families of algorithms to compute a field multiplication : fullparallel, bitserial, and digitserial [24]. The fullparallel approach is the most costly in terms of area usage but is the fastest while the bitserial approach is generally the most compact but its slower. The digitserial approach allows a tradeoff between computation time and area usage.
Related works are discussed in this section, based on the type of multiplier being used (bitserial, digitserial), computing approach (LSE, MSE), the implementation platform (FPGA type), the finite field size, and implementation results in terms of time and area (FPGA slices). Note that our contribution is on the multiplier being used and in the computing approach (digitdigit). This approach has not been explored, and we present for the first time an FPGA accelerator for ECC based on such approach.
Digitserial and bitserial approaches to field multiplication are iterative algorithms that process one of the operands in the multiplication from righttoleft (MSE) or from lefttoright (LSE). At each iteration, the partial results need modular reduction. Bertoni et al. [25] presented an easy way to perform modulo reduction when partial results have coefficients with powers greater than (e.g., ). Beuchat et al. [24] surveyed some of the most representative implementations using MSE and LSE algorithms (including implementations presented in [25]).
Digitserial implementations (with digit size ) require iterations using degree partial results [26]. However, in [27], it is proposed to use degree partial results to improve computation performance at the cost of one extra iteration, requiring iterations to compute multiplication over . The digitserial algorithm proposed in [25] requires iterations and keeps degree partial results to improve computation performance. Beuchat [24] concluded that the MSE first approach requires less hardware and offers higher throughput than LSE. In [28], the reduction steps are performed separately. It is stated that for a finite field generated by irreducible polynomials (NIST [29]), reduction can be performed by a set of xor operations [30, 31]. [28] is considered only the multiplication step, implemented in a digitserial approach. A digit is proposed since in most cases, 16bit words give better results.
In [32], it is used a LSE digitserial multiplier; however, a digit size of one bit (bitserial) resulted the most compact version. [33] is proposed a systolic hardware architecture to compute multiplication/inversion in the same hardware. Furthermore, an arithmetic unit is constructed that can perform all arithmetic operations required in elliptic curve cryptography. [34] is presented for the first time a digitdigit multiplier under a MSE basis. Operands, modulus, and partial results are partitioned in digits and processed one digit at a time. The main advantage compared to digitserial or bitserial implementations is that operands and partial results can be stored in BRAMs instead of shift registers which saves standard logic (slices). However, the multiplier presented is designed and evaluated as a standalone module which is hard to directly use in a engine.
Table 1 summarizes the most relevant works for multiplication in FPGA, the main algorithms used, and the area/time results. Table 2 shows some of the most representative works of hardware designs for computation in the hardware. Most of the reported works use the bitserial or digitserial approach to implement hardware operators. However, hardware resources required in these approaches depend directly on the operands size (field size ), because even when one of the operands is iteratively processed, the other one is processed in parallel.


The bitserial approach requires small amount of hardware resources compared to the digitserial or fullparallel approach, but for large operands, even using the bitserial approach requires a considerable amount of hardware resources (slices). However, some recent works already proposed using a digitdigit approach, for example, [34, 35]. The main drawback with the multiplier presented in [34] is the use of shift registers to store partial results and the infeasibility of using such design for practical engine and for [35] is to fit the digit sizes to FPGAs embedded DSP multipliers.
In order to reduce area requirements and achieve a compact design well suited for IoT applications, the approach in this work to construct a hardware accelerator follows the digitdigit computation approach and makes use of multipliers and memory blocks embedded in most of the FPGAs to save FPGA standard logic. By implementing a strategy for reusing memory blocks, critical for the iteratively processing of the digitdigit approach, considerable area resources are saved but retaining the advantage of processing iteratively both operand in the multiplication and not only one as in the digitserial or bitserial approaches. Additionally, since memory blocks are bigger than operands, it is proposed to used part of the available memory blocks to store control signals thus (microprogramming) avoiding logic to implement a state machine for control.
2.4. Novel DigitbyDigit Elliptic Curve Point Multiplication Hardware Architecture
The proposed ECC engine, suitable for FPGAbased sensor nodes in the IoT, is constructed following a layeredbased approach. The low level is the arithmetic, where field multiplication is the main operation to be optimized in terms of area resources. Next, using the multiplier as a building block in the high layer is the curve arithmetic, consisting in the optimized realization of Algorithm 1 in terms of area resources, where the multiplier is used to compute each of the point additions (lines 8 and 10). At this level, the multiplier is used to realize field inversion and field squaring required in the addition and double point operations. In both layers, the proposed design methodology takes advantage of block RAMs (BRAMs) embedded in modern FPGAs to store the operands, partial, and final results, reusing the BRAMs as much as possible, using a carefully field operation scheduling, and memory management strategy.
2.4.1. Field Arithmetic
Arithmetic in is done using polynomial basis. Under this representation, each element in the field is an ()degree polynomial over the field . The two binary operators are addition and multiplication with reduction modulo which is an irreducible polynomial of degree . Field addition is the bitwise XOR operation of coefficients (carry free, no reduction needed), a cheap operation when implemented in the hardware. Additive inverse in under polynomial basis is also easy to implement, as for any in , , with 0 as the neutral addition element (all zero polynomial).
Multiplication and multiplicative inverses (or simply inversion) in are more complex operations. Since Algorithm 1 only requires one inversion at the end of the computation, field inversion is implemented using the IthoTsuji algorithm, by a series of multiplications. So, the field multiplier becomes the most critical operation to be carefully implemented in ECC hardware approaches and one of the critical component in our engine.
2.4.2. Multiplication
In the literature, there are basically three computing approaches for computing field multiplication in the hardware: bitserial (the most compact design), digitalserial (for areaperformance tradeoffs), and fullparallel (the fastest but also the costlier solution in terms of area). The most significant element (MSE) and least significant element (LSE) (bitserial or digitserial) are the commonly used algorithms to compute multiplications over .
In this work, we propose a novel digitdigit multiplier algorithm well suited to be integrated into a engine. The digitdigit computing approach aims at performing better than a bitserial multiplier, keeps the property of allowing exploring areaperformance tradeoffs when realized in hardware, and it is not as expensive as a full parallel realization. This is consistent with our design methodology to achieve a compact architecture (simpler datapath) for the engine. Details of the digitdigit multiplier are presented in Section 2.4.3.
multiplication using the digitdigit computing approach was previously suggested in [34]. However, the multiplier design in that work is not suitable for a direct application in a engine. The authors in that work only proved the advantages of the digitdigit approach versus the wellknown bitserial and digitserial multipliers, as a standalone module. However, when that multiplier is considered for realizing the operation, several issues must be solved.
Being the multiplier part of a series of operations implied by each point addition operation in the main loop in the computation, the main challenge for the digitdigit multiplier is the fact that partial results at each iteration in the digitdigit multiplier and the final result (possibly operated with other values) are the input operand for the same multiplier in next iterations. So, during the digitdigit computation, the multiplier must keep its operands in memory blocks and and progressively stores the partial results in another one . At the end, the results in should be moved to or for further processing (a operation requires several multiplications), introducing a delay in the computation, unless that data movement is done during the computation. So, or must act as an input and output memory at the same time. Since a complete operation requires several hundreds of multiplications, using the multiplier as proposed in [34] without addressing the previous data memory management issue is totally unpractical.
As it is explained in the next section, the main issue to integrate a digitdigit multiplier in the engine is to implement an efficient data memory management, ensuring consistency in the correct execution of both the digitdigit field multiplier and the scalar multiplication algorithm. In this work, we present the design of a novel digitdigit multiplier that achieves compact designs by optimizing the resources for finite fields defined by trinomials.
2.4.3. DigitDigit Multiplier
Parting from the definition of elements in , as polynomials of the form with binary coefficients, in this section, we present how the mathematical expression that computes an multiplication in a digitbydigit fashion is derived (from Eq. (2) to Eq. (9)). This expression leads to the specification of the multiplier that is the building block of our FPGAbased engine for scalar multiplication in ECC.
An element of the form can be represented as the sum of polynomials (digits) each of coefficients in (Eq. (2)).
So, Eq. (3) expresses the multiplication in a digitserial approach.
Let , , and the ()degree polynomial resulting from the partial product at iteration in Eq. (3). By parsing elements of from lefttoright (MSE), computation at iteration is determined by recurrence in Eq. (4):
where polynomial has the most degree (), while is of degree . After iterations, the polynomial of degree () needs reduction. By introducing an extra iteration with and , is the result. The term in this last expression can be easily reduced modulo by only discarding the digit .
Being an degree polynomial, . So, , a polynomial of degree with . Thus, elements with can be reduced using equivalence .
Degree of from Eq. (5) (after reduction) is at most . This polynomial becomes the polynomial to be reduced in the next iteration . So, at each iteration , it is required to reduce the terms of , . By using the previous assumption for polynomial reduction being a trinomial, the reduction in Eq. (5) can be defined as in Eq. (6).
This way, is partitioned in two polynomials and of degree and , respectively. The partial multiplication will not require modular reduction if . So, Eq. (5) can be rewritten as in Eq. (7).
Under the digitdigit computation approach, the polynomial , , and is represented in digits. Since the degree is , the computation can be achieved iteratively, taken digit and iterating through digits. Taking as a constant, . With this new notation, the first term in Eq. (4) can be rewritten as in Eq. (7).
Once and are expressed to be processed in an iterative way one digit at a time, Eq. (7) can be rewritten in a notation that leads to an iterative, digitbydigit computation of each partial product of multiplication, given by Eq. (9).
At each iteration, values and can be computed in a parallel way. For the sake of clarity about the computations in Eq. (9), the sum of digits and can be expressed as a single variable . This new variable is () bits in size as shown in Figure 2.
With all these considerations, the proposed algorithm for computing multiplication over is presented in Algorithm 2.
2.4.4. DigitDigit Multiplier Hardware Architecture
To achieve compactness, in this work, we propose the realization in hardware of Algorithm 2 in its simplest form. The hardware architecture only requires one partial multiplier and is optimized for binary fields defined by a trinomial. The NIST and other compliant standards have recommended trinomials for binary fields, for example, and .
If the degree trinomial is used, is used for the reduction step. So, if (digit size) is used, when a digit of g(x) () is read, only the two first digits will have a value of 1, when digit will be always 0. In this case, the partial multiplier that computes always computes a multiplication of the form or which can be implemented only with an “and” gate. In conclusion, when a trinomial of the form is used, it is possible to define the digit size . In this case, the partial multiplier that computes can be implemented using only a multiplexer as it is shown in Figure 3.
2.4.5. Curve Arithmetic
The hardware for elliptic curve scalar multiplication is guided by the execution of Algorithm 1, which is based on the iteratively call to point addition functions Madd and Mdouble.

Figure 4 shows the required operations at each iteration of Algorithm 1 and the underlying operations (denoted by circles). After each operation, the figure also shows the memory where the intermediate values are stored. For example, the memory stores the first field operation in the point addition operation. While five multiplications are needed to compute a single Madd operation, six multiplications are required for Mdouble.
The schedule of field operations shown in Figure 4 considers only the use of four memories to compute the complete Madd function, by reusing the memory blocks properly. For the case of Mdouble, also four memories are enough. The memories are alternatively used as shown in the figure to act as the repository for the input parameters to a field multiplier/adder or as the repository for the multiplication/addition result. We stress again the fact that a proper data memory management must be implemented to avoid the delays induced by moving data from the result memory to the input parameter memory in the chained operations.
Since in Algorithm 1, only the and coordinates of elliptic curve points in projective representation are used, and each point is stored in two BRAMs, one for the and the other for the coordinate. In Figure 4, the memories for the points and are represented by the variables .
For Madd, let us consider the first multiplication stored in and the second multiplication stored in . Both multiplications can be done in parallel, with memories acting as reading memories and and acting as the writing memories. For the third multiplication , memories and must switch to act as reading memories, and the result can be stored in , the memory that initially stored one of the input parameters and now acts as a writing memory. As the multiplier delivers a result at each stage in point addition, at the same time, it processes the input digits. So, a careful management of the memory is required to avoid latency for data movement for result and input parameter memories. This requirement arises because the result of the field multiplier in an earlier stage becomes the input parameter of later stages.
In the rest of the point addition computation, memories alternate their functionality following the switching strategy of read/write memories. At the end, the final result must be in a memory, that is used in the next iteration at line 6 in Algorithm 1, so that values will reside in one of the four available memories, and input parameters in next iteration in the main loop of Algorithm 1 are adjusted. Memories associated to points and are overwritten with new partial results coming from the Madd and Mdouble functions.
At line 8 (or also in line 10) in the main loop of Algorithm 1, the memories storing and are read memories, and the result is stored finally in memories and (see Figure 4). In the next iteration, and become and input parameters, and the corresponding memories and become the storage for the result of the final point addition. So, at the curve level algorithm, the memories are also interchanged in their functionality and properly mapped to the memories for the final results in Figure 4. An extra BRAM is required to store the scalar .
The building blocks to compute as described in Figure 4 are those for field arithmetic operations: addition, multiplication, square, and inversion over . The square operation is considered easier than multiplication. However, since in this work operands are stored in BRAMs, and reading/writing of operands are performed one digit at a time, it is difficult to take advantage of the optimized algorithm such as the fast reduction algorithm proposed by NIST commonly used in squaring. So, to save hardware resources, this work uses one multiplication core to compute square operations. The reusing od the multiplier saves area but increases latency. Also, inversion is computed with the IthoTsuji algorithm by means of multiplications, squares, and additions in .
At each iteration of Algorithm 1, Madd and Mdouble operations can be computed in parallel since there is no data dependency. In this work, we propose to use a multiplier in Madd and other in Mdouble to take advantage of parallelism. In the dataflow for each point addition, the multiplier is reused. In addition to the multipliers, one adder is also required. The same adder can be used in both the Madd and Mdouble operations since it is required at different times in each operation.
Although more than one multiplier could be added to speed up the computation, that approach resulted in extra cost of hardware resources not only because of the area required by the multiplier but also for the increased complexity in the control module and additional multiplexers to manage input/output operands to the cores.
The entire dataflow is managed by a control unit that stimulates the memory blocks for wordbased reading and writing and also commands the cores (multipliers and adder). The control module waits until each partial multiplication/addition has finished and starts the following required operations with the correct BRAM as input sources.
3. Results and Discussion
The proposed compact hardware ECC design was implemented over the binary fields and , both defined by an irreducible trinomial. The elliptic curves used were sect233 and sect409, both recommended by NIST and other recognized organizations such as SECG. The target platform was the IoT recommended FPGA board MicroZed, with Xilinx Vivado HLx 2016.4 as the developer tool.
The hardware architecture for scalar multiplication in was evaluated in a hardwaresoftware codesign of the DiffieHellman key exchange elliptic curve (ECDH) version. Let it consider that two FPGAbased sensor nodes [36] and agree on an elliptic curve group with generator and order . Then, each party selects a secret integer, for example, and . Using a engine, each party computes public values:
Sensor uses the s public value to compute , and the sensor uses the s public value to compute . Since is the same as , acts as a shared secret key between the sensors and , so a secure channel can be established to transport data between the two devices in an encrypted form (for example, using a lightweight block cipher). Indeed, signatures can be generated to authenticate data by using the secret to authenticate a message, using, for example, LightMac. The main complexity in ECDH (as in other ECCbased cryptographic schemes) is the computation of .
3.1. Hardware/Software Codesign
Figure 5 shows the proposed hardwaresoftware codesign for the scalar multiplier over , suitable to be realized in an FPGA sensor node. The codesign was realized in the MicroZed board, and the implementation results are shown in Table 3. This is a representative final application under an IoT scenario (IIoT, MIoT) where sensor nodes are deployed using SoC technology: the scalar multiplication is executed in FPGA technology coupled to a master general purpose processor that runs the rest of the application logic. The hardwaresoftware codesign required 1809 slices of the FPGA embedded in the MicroZed board running at 62.5 MHz.

Table 3 also compares the time to achieve a scalar multiplication under the hardware/software codesign versus a pure software implementation. This is done to highlight the gain in performance from a hardware approach for the most timeconsuming operation in ECC, as in ECDH. For this, we used the MIRACL library for the software implementation of scalar multiplication in the Cortex A9 of the Zynq, also available in the MicroZed board. In this case, we used the same implementation parameters: curve, finite field, size of the finite field, irreducible polynomial, projective coordinates, and the same Algorithm 1 for scalar multiplication.
The hardwareaccelerated execution of requires 4.13 ms to compute an elliptic curve Diffie Hellman key exchange versus the pure software implementation in the MicroZed with the MIRACL library that requires 70 ms. Thus, our codesign is 17 times faster than the pure software implementation while only requires 36% of the FPGA slices in the MicroZed, leaving 66% of the FPGA’s standard logic available for other application requirements in the sensor node. These results show that our design retains the advantages of a hardware implementation by improving the performance at the time that it uses less area resources.
3.2. Comparison with Other Similar FPGA Designs
Table 4 shows a comparison with stateoftheart works for FPGA scalar multipliers in . In this comparison, we are using the same elliptic curves, finite fields and sizes, and the same irreducible polynomial. A fair comparison is very difficult to achieve due to different FPGA technologies and implementation strategies being used. It is not possible to compare all the works under the same criteria, since some hardware designs exploit the use of embedded blocks such as DSPs or block rams (BRAMs) while others take advantage of the available slices/LUTs. However, this research is focused in lightweight implementations with the goal to use low standard logic resources. So, embedded memory blocks in the FPGAs are exploited to reduce standard reconfigurable logic (slices). The comparison in Table 4 is mainly in terms of FPGA standard logic (slices) reported. Although efficiency and throughput are not the main aims of this research, they are used as reference metrics.

The results presented in [32] are proposed for a digitserial approach for multiplication and inversion over , and square and addition over are computed fully with standard logic in only one clock cycle. Compared to our design, those results are almost ten times better according to efficiency. However, our design uses considerable less area resources. For example, for a digit size of 8, 16, and 32, the required area is 442, 626, and 1170 slices, respectively. In [37], it is presented a hardware architecture for elliptic curve scalar multiplication over implemented for the NISTrecommended binary fields and . That scalar multiplier hardware architecture requires 3016 and 4625 slices for the operand size 233 and 283, respectively. Compared to that design, our engine for requires 6.8 times more slices and 2.2 times better efficiency (Mbps/slice). The scalar multiplier over presented in [38] is better in efficiency than ours, but at a considerable high costs in terms of area usage.
Table 4 shows that most of the works achieve better throughput/efficiency than our proposed hardware design. However, the main aim of these works is to save hardware resources (slices), and this is achieved by sacrificing throughput. According to the obtained results, it is observed that despite the throughput sacrificing, the proposed design achieves significantly better performance than software counterparts while using fewer resources that are similar FPGA designs. The reduction in area resources is a direct result of using a digitbydigit computing approach in the layered structure of the engine, mainly determined by the multiplier and the strategy for reusing memory blocks during the iterative processing of operands.
In Figure 6, we show graphically how our design uses considerable fewer standard logic resources from the FPGA, so leaving more logic for other tasks in the upper application layers. In that figure, FPGA resource usage is compared against the works that use FPGA implementation technology, digitserial approach, and comparable security levels. Note from this figure that our design is scalable in terms of area because a greater security level only impacts latency. This property is only kept with the digitdigit computing approach.
4. Conclusion
We have detailed the design and evaluation of a compact FPGAbased ECC hardware design, well suited for Internet of Things applications, specifically for the Industrial Internet of Things (IIoT) or Internet of Medical Things (MIoT), where sensor nodes can be realized with FPGA technology. The key contributions include a novel digitdigit algorithm for multiplication over optimized for fields defined by trinomials and its corresponding compact hardware architecture, which is the main core for constructing a compact hardware design for computing scalar multiplications in binary elliptic curves over generated by trinomials, such as the ones recommended by NIST for practical use. We proposed a novel rescheduling of operations in the LopezDahab Montgomery algorithm for elliptic curve scalar multiplication that can be computed with only two multipliers and one adder in a digitdigit fashion, thus reducing area requirements for the hardware design. For correctness, we validate our design by a hardware software codesign in the IoT MicroZed Xilinx FPGA, by executing an instance of the DiffieHellman key exchange protocol (ECDH), a common crucial operation in IoT secure sensor nodes networks. To our knowledge, the proposed hardware ECC architecture requires less standard hardware resources (slices) in FPGAs than other works reported to date while takes advantage of memory blocks already available in modern FPGAs. Furthermore, despite of being a compact hardware architecture, it was demonstrated that a considerable acceleration of a representative curvebased cryptographic protocol is obtained compared to a pure software implementation.
Using the proposed ECC accelerator, further work is planned to evaluate the security service costs when implementing ECCbased cryptographic protocols such as digital envelopes and digital signatures in real application scenarios of IoT, IIoT, and MIoT.
Data Availability
Raw data were generated at INAOE Computer Science Department and at Cinvestav Tamaulipas. Derived data supporting the findings of this study are available from the corresponding author MMS on request.
Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.
Acknowledgments
This research was supported by the Fondo Sectorial de Investigación para la Educación, Ciencia Básica SEPCONACyT, project number 281565. Also, the research was partially funded by project PN20175814, Conacyt Problemas Nacionales.
References
 A. Mosenia and N. K. Jha, “A comprehensive study of security of internetofthings,” IEEE Transactions on Emerging Topics in Computing, vol. 5, no. 4, pp. 586–602, 2017. View at: Publisher Site  Google Scholar
 L. Z. Cai and M. F. Zuhairi, “Security challenges for open embedded systems,” in Engineering Technology and Technopreneurship (ICE2T), 2017 International Conference on, pp. 1–6, Kuala Lumpur, Malaysia, 2017. View at: Google Scholar
 D. Schinianakis, “Alternative security options in the 5g and iot era,” IEEE Circuits and Systems Magazine, vol. 17, no. 4, pp. 6–28, 2017. View at: Publisher Site  Google Scholar
 T. Eisenbarth, S. Kumar, C. Paar, A. Poschmann, and L. Uhsadel, “A survey of lightweightcryptography implementations,” IEEE Design & Test of Computers, vol. 24, no. 6, pp. 522–533, 2007. View at: Publisher Site  Google Scholar
 C. Manifavas, G. Hatzivasilis, K. Fysarakis, and K. Rantos, “Lightweight cryptography for embedded systems – a comparative analysis,” in Data Privacy Management and Autonomous Spontaneous Security, pp. 333–349, Springer. View at: Google Scholar
 C. A. LaraNino, A. DiazPerez, and M. MoralesSandoval, “Elliptic curve lightweight cryptography: a survey,” IEEE Access, vol. 6, pp. 72514–72550, 2018. View at: Publisher Site  Google Scholar
 P. Yalla and J. P. Kaps, “Lightweight cryptography for fpgas,” in 2009 International Conference on Reconfig urable Computing and FPGAs, pp. 225–230, Quintana Roo, Mexico, 2009. View at: Publisher Site  Google Scholar
 A. DiazPerez, M. MoralesSandoval, and C. LaraNino, “Use of FPGAs for enabling security and privacy in the IoT: features and case studies,” in FPGA Algorithms and Applications for the Internet of Things, chapter 2, P. Sharma and R. Nair, Eds., pp. 26–45, IGI Global, 2020. View at: Google Scholar
 G. Xu, Z. Chen, and P. Schaumont, “Energy and performance evaluation of an fpgabased soc platform with aes and present coprocessors,” in Embedded Computer Systems: Architectures, Modeling, and Simulation, M. Berekovic, N. Dimopoulos, and S. Wong, Eds., pp. 106–115, Springer, Berlin, Heidelberg, 2008. View at: Google Scholar
 H. Abdelkrim, S. Ben Othman, and S. Ben Saoud, “Reconfigurable soc fpga based: Overview and trends,” in 2017 International Conference on Advanced Systems and Electric Technologies, pp. 378–383, Hammamet, Tunisia, 2017. View at: Google Scholar
 X. Zhang, A. Ramachandran, C. Zhuge et al., “Machine learning on fpgas to face the iot revolution,” in 2017 IEEE/ACM International Conference on ComputerAided Design (ICCAD), pp. 894–901, Irvine, CA, USA, 2017. View at: Google Scholar
 C. Hao, X. Zhang, Y. Li et al., “Fpga/dnn codesign: an efficient design methodology for iot intelligence on the edge,” in Proceedings of the 56th Annual Design Automation Conference 2019, DAC ‘19, New York, NY, USA, 2019. View at: Publisher Site  Google Scholar
 S. Wang, Y. Hou, F. Gao, and X. Ji, “A novel iot access architecture for vehicle monitoring system,” in 2016 IEEE 3rd World Forum on Internet of Things (WFIoT), pp. 639–642, Reston, VA, USA, 2016. View at: Google Scholar
 B. Zhou, M. Egele, and A. Joshi, “Highperformance lowenergy implementation of cryptographic algorithms on a programmable soc for iot devices,” in 2017 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6, Waltham, MA, USA, 2017. View at: Google Scholar
 Xilinx Inc, Microzed industrial iot starter kit, April 2020, http://zedboard.org/product/microzediiotstarterkit.
 V. S. Miller, “Use of elliptic curves in cryptography,” H. C. Williams, Ed., pp. 417–426, Springer. View at: Google Scholar
 N. Koblitz, “Elliptic curve cryptosystems,” Mathematics of Computation, vol. 48, no. 177, pp. 203–209, 1987. View at: Publisher Site  Google Scholar
 P. Szczechowiak and M. Collier, “Tinyibe: identitybased encryption for heterogeneous sensor networks,” in 2009 International Conference on Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP), pp. 319–354, Melbourne, VIC, Australia, 2009. View at: Google Scholar
 N. Oualha and K. T. Nguyen, “Lightweight attributebased encryption for the internet of things,” in 2016 25th International Conference on Computer Communication and Networks (ICCCN), pp. 1–6, Waikoloa, HI, USA, 2016. View at: Google Scholar
 Z. U. A. Khan and M. Benaissa, “Highspeed and lowlatency ecc processor implementation over gf(2^{m}) on fpga,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 1, pp. 165–176, 2017. View at: Publisher Site  Google Scholar
 J. López and R. Dahab, “Fast multiplication on elliptic curves over gf(2m) without precomputation,” pp. 316–327, SpringerVerlag. View at: Google Scholar
 D. Karaklajic, J. Fan, J. Schmidt, and I. Verbauwhede, “Lowcost fault detection method for ecc using montgomery powering ladder,” 2011 Design, Automation Test in Europe, pp. 1–6, 2011. View at: Google Scholar
 D. Dinu, Y. Le Corre, D. Khovratovich, L. Perrin, J. Großschädl, and A. Biryukov, “Triathlon of lightweight block ciphers for the internet of things,” Journal of Cryptographic Engineering, vol. 9, pp. 1–20, 2015. View at: Google Scholar
 J.L. Beuchat, T. Miyoshi, Y. Oyama, and E. Okamoto, “Multiplication over F_{p}m on fpga: A survey,” in Reconfigurable Computing: Architectures, Tools and Applications, P. C. Diniz, E. Marques, K. Bertels, M. M. Fernandes, and J. M. P. Cardoso, Eds., pp. 214–225, Springer, Berlin, Heidelberg, 2007. View at: Google Scholar
 G. Bertoni, J. Guajardo, S. Kumar, G. Orlando, C. Paar, and T. Wollinger, “Efficient GF(p^{m}) arithmetic architectures for cryptographic applications,” in Topics in Cryptology — CTRSA 2003, M. Joye, Ed., pp. 158–175, Springer, Berlin, Heidelberg, 2003. View at: Google Scholar
 C. Shu, S. Kwon, and K. Gaj, “Fpga accelerated tate pairing based cryptosystems over binary fields,” in 2006 IEEE International Conference on Field Programmable Technology, pp. 173–180, Bangkok, Thailand. View at: Google Scholar
 L. Song and K. K. Parhi, “Lowenergy digitserial/parallel finite field multipliers,” Journal of VLSI signal processing systems for signal, image and video technology, vol. 19, no. 2, pp. 149–166, 1998. View at: Google Scholar
 D. Pamula and E. Hrynkiewicz, “Areaspeed efficient modular architecture for GF(2m) multipliers dedicated for cryptographic applications,” in 2013 IEEE 16th International Symposium on Design and Diagnostics of Electronic Circuits Systems (DDECS), pp. 30–35, Karlovy Vary, Czech Republic, 2013. View at: Publisher Site  Google Scholar
 National Institute of Standards and Technology, Digital Signature Standard (DSS), Appendix D, Recommended Elliptic Curves for Federal Government Use, 1999, https://csrc.nist.gov/csrc/media/publications/fips/186/3/archive/20090625/documents/fips_1863.pdf.
 D. Hankerson, A. J. Menezes, and S. Vanstone, Guide to Elliptic Curve Cryptography, SpringerVerlag New York, Inc., Secaucus, NJ, USA, 2003.
 D. Pamula, Arithmetic operators on GF(2m) for cryptographic applications: performance  power consumption  security tradeoffs, [Ph.D. thesis], Université Rennes 1, 2012, https://tel.archivesouvertes.fr/tel00767537.
 G. D. Sutter, J. Deschamps, and J. L. Imana, “Efficient elliptic curve point multiplication using digitserial binary field operations,” IEEE Transactions on Industrial Electronics, vol. 60, no. 1, pp. 217–225, 2013. View at: Publisher Site  Google Scholar
 A. P. Fournaris and O. Koufopavlou, “Low area elliptic curve arithmetic unit,” in 2009 IEEE International Symposium on Circuits and Systems, pp. 1397–1400, Taipei, Taiwan, 2009. View at: Publisher Site  Google Scholar
 M. MoralesSandoval and A. DiazPerez, “Area/performance evaluation of digitdigit GF(2^{k}) multipliers on fpgas,” in 23rd International Conference on Field programmable Logic and Applications, Porto, Portugal, 2013. View at: Google Scholar
 I. San and A. Nuray, “Improving the computational efficiency of modular operations for embedded systems,” Journal of Systems Architecture, vol. 60, no. 5, pp. 440–451, 2014. View at: Google Scholar
 B. Bengherbia, M. O. Zmirli, A. Toubal, and A. Guessoum, “Fpgabased wireless sensor nodes for vibration monitoring system and fault diagnosis,” Measurement, vol. 101, pp. 81–92, 2017. View at: Publisher Site  Google Scholar
 M. S. Hossain, E. Saeedi, and Y. Kong, “Highspeed, areaefficient, fpgabased elliptic curve cryptographic processor over nist binary fields,” in 2015 IEEE International Conference on Data Science and Data Intensive Systems, pp. 175–181, Sydney, NSW, Australia, 2015. View at: Google Scholar
 Z. Khan and M. Benaissa, “Throughput/areaefficient ecc processor using montgomery point multiplication on fpga,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 62, no. 11, pp. 1078–1082, 2015. View at: Google Scholar
 W. Wei, L. Zhang, and C. Chang, “A modular design of ellipticcurve point multiplication for resource constrained devices,” in 2014 International Symposium on Integrated Circuits (ISIC), pp. 596–599, Singapore, Singapore, 2014. View at: Publisher Site  Google Scholar
 M. N. Hassan and M. Benaissa, “Low areascalable hardware/software codesign for elliptic curve cryptography,” in 2009 3rd International Conference on New Technologies, Mobility and Security, pp. 1–5, Cairo, Egypt, 2009. View at: Publisher Site  Google Scholar
 S. S. Roy, C. Rebeiro, and D. Mukhopadhyay, “Theoretical modeling of elliptic curve scalar multiplier on lutbased fpgas for area and speed,” vol. 21, no. 5, pp. 901–909, 2013. View at: Google Scholar
Copyright
Copyright © 2021 Miguel MoralesSandoval et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.