Abstract

In the emerging IoT ecosystem in which the internetworking will reach a totally new dimension the crucial role of efficient security solutions for embedded devices will be without controversy. Typically IoT-enabled devices are equipped with integrated circuits, such as ASICs or FPGAs to achieve highly specific tasks. Such devices must have cryptographic layers implemented and must be able to access cryptographic functions for encrypting/decrypting and signing/verifying data using various algorithms and generate true random numbers, random primes, and cryptographic keys. In the context of a limited amount of resources that typical IoT devices will exhibit, due to energy efficiency requirements, efficient hardware structures in terms of time, area, and power consumption must be deployed. In this paper, we describe a scalable word-based multivendor-capable cryptographic core, being able to perform arithmetic operations in prime and binary extension finite fields based on Montgomery Arithmetic. The functional range comprises the calculation of modular additions and subtractions, the determination of the Montgomery Parameters, and the execution of Montgomery Multiplications and Montgomery Exponentiations. A prototype implementation of the adaptable arithmetic core is detailed. Furthermore, the decomposition of cryptographic algorithms to be used together with the proposed core is stated and a performance analysis is given.

1. Introduction

The next generation of embedded systems and IoT devices will exhibit a much higher degree of internetworking which gives rise to security considerations [1]. As a logical consequence, such devices must become cryptographic nodes, besides others, being capable of encrypting/decrypting and signing/verifying data as well as establishing spontaneous secured communications by exchanging common secrets used for secret key calculation. While many embedded chips already have support for hardware-accelerated symmetric algorithms (mainly AES) [2] and hash functions, due to various reasons, such as complexity, space, and costs, they lack in hardware support especially for supporting a wide range of public-key and key exchange algorithms with different precision widths. Besides, many modern cryptographic primitives necessitate the capability for producing true random numbers and random prime numbers. Typical IoT devices furthermore very often only exhibit a limited amount of resources which requires efficient cryptographic hardware structures in terms of area, power consumption, and calculation performance [3]. In general enterprises developing IoT products basically have three options to include application functionalities in high integrated devices, using Application Specific Standard Products (ASSP), Application Specific Integrated Circuits (ASIC), or Field Programmable Gate Arrays (FPGA). Today FPGAs have become promising components for IoT applications [4], compared to ASSP solutions which often cannot provide the required functionality and can provide a better Total Cost of Ownership (TCO) compared to ASIC solutions. Thus for devices which are equipped with a FPGA device, it is valuable to examine how efficient hardware structures for performing cryptographic operations can be included.

In matters of algorithm agility an arithmetic engine with minimal hardware footprint, which can handle the arithmetic operations of a great variety of cryptographic algorithms, is of great importance for IoT based devices. Especially the calculability of the individual operations leading to lower and upper calculation time bounds is quite important.

This paper proposes a tiny-held vendor-neutral cryptographic arithmetic core exemplarily implemented in FPGA-logic. For efficiency, time-intensive modular operations, such as multiplication and exponentiation operations, Montgomery Arithmetic is used. Without the need of any expensive software precalculations the core is able to perform a high number of cryptographic algorithms and handle various key sizes by simply processing operation lists. Furthermore the core architecture is unified and can perform calculations in both prime finite fields () and binary extension fields (). To illustrate the versatility of the developed core, well-established cryptographic algorithms have been rewritten and fragmented into operation lists to be processed by the arithmetic engine.

The paper is organized as follows. Section 2 states the related work of this research. In Section 3 the design of the proposed Enhanced Montgomery Multiplication Core is stated; the specified functional range of the core is given in Section 4. In Section 5 some exemplary application descriptions for the core are mentioned and in Section 6 the results of the performance analysis are stated. Finally, Section 7 concludes the paper.

The efficiency of cryptographic algorithms when implemented on reconfigurable hardware is mainly determined by the fact of how the underlying finite field arithmetic operations are realized [5]. Several applications in cryptography such as ciphering and deciphering of asymmetric algorithms, the creation and verification of digital signatures, and secure key exchange mechanisms require excessive use of the basic finite field modular arithmetic operations addition, multiplication, and the calculation of the multiplicative inverse. Especially the field multiplication operation is crucial to the efficiency of a design, since it is the core operation of many cryptographic algorithms [6].

In [7] P. L. Montgomery introduced a representation of residue classes in order to speed up modular multiplications without affecting modular additions and subtractions. Over the years numerous designs have been proposed implementing modular multiplications based on Montgomery’s multiplication algorithm [8]. The foundation for these architectures was presented by A. Tenca and Ç. Koç in [9]. The architecture is based on a word-based Montgomery Multiplication algorithm for prime finite fields in which multiplications are performed in a bit-serial fashion. E. Savaş et al. in [10] have proposed an extension which, in addition to the standard integer modulo arithmetic, also allows polynomial computations over binary finite fields. An overview about algorithms and hardware architectures for Montgomery Multiplication can be found in [11]. Optimizations of the original design have been proposed concerning the hardware implementation of the Montgomery Multiplication algorithm [12] as well as by utilizing special arithmetic hardcore extensions of FPGAs to accelerate digital signal processing applications [13]. Some designs only focus on utilizing the Montgomery Multiplication method to accelerate modular exponentiation operations as required by the RSA algorithm [14, 15].

However, no publication focuses on how the Montgomery Multiplication architecture can be embedded into a comprehensive solution. In this paper we propose an enhanced version of a bit-serial word-based unified Montgomery Multiplication core based on logic elements only which is controlled by a state machine and offers the functional range to be able to perform complete cryptographic algorithms without additional complex processing required in software.

3. Enhanced Montgomery Multiplication Core

3.1. Requirements

Today a high number of different public-key algorithms are in use. To ensure compatibility, cryptographic applications must support a large portion of those algorithms. While typical software implementations often can easily be upgraded in order to adapt new algorithms and larger key sizes, the same is not necessarily true for hardware implementations. Therefore following requirements have been identified for the Enhanced Montgomery Multiplication Core:(i)Use of Montgomery Arithmetic. The design must be able to perform modulo operations in a time-efficient manner by using Montgomery Arithmetic. At least the core must support Montgomery Multiplications and Montgomery Exponentiations. Furthermore the core must support standard modulo additions and modulo subtractions.(ii)Works on Both Finite Fields and . The architecture must exhibit an unified structure supporting both standard integer modulo operations of prime finite fields as well as polynomial calculations of binary finite fields.(iii)Montgomery Parameter Calculation. In general the Montgomery Parameters ( and ) can be precomputed for previously known moduli. However, as a requirement the core must be able to handle arbitrary moduli. Therefore it must be capable of calculating the Montgomery Parameters , and without the need of precalculations done in software.(iv)Scalable Design. The architecture must be scalable in terms of timing, area, and power consumption. This includes the parametrisation of the word width, the internal storage size, and the amount of processing units within the pipeline.(v)Multialgorithm Support. The core must be based on a building-block design. The functional range provided by the arithmetic unit should empower algorithm agility, by fragmenting cryptographic algorithms into a list of core operations. At least the core must be capable of performing RSA [16] operations, (safe) prime number generation and primality testing (MR) [17, 18], key exchange operations (DH) [19], and elliptic curve calculations (EC) [20] over both prime and binary finite fields.(vi)Supporting as Many Precision Widths as Possible. The design must support a wide range of different precision widths determining the security level of the cryptographic algorithm. If a certain security level, due to increased attacking computing power, becomes inadequate, the precision width can be adjusted accordingly which makes the hardware less prone to become obsolete due to higher security demands. The core must support the current recommendations for minimum key sizes [21] and should also support larger key sizes. For RSA algorithm and Diffie-Hellman key exchange support the architecture should be able to handle precisions up to bit moduli, for elliptic curve cryptography support precisions up to bits for prime finite fields and precisions up to bits for binary finite fields should be possible.(vii)Time-Invariant Operations. The architecture must be capable of performing its operations in a time-invariant manner. If security sensitive information, such as private keys, will be processed, it must be ensured that all operations exhibit the same execution time to prevent side-channel attacks based on timing analysis.

3.2. Overall Core Architecture

Figure 1 illustrates the overall architecture of the proposed Enhanced Montgomery Multiplication Core which is capable of meeting all requirements as specified above.

Besides the pipeline of processing units handling the main part of the word-based Montgomery Multiplication algorithm, the core features an enhanced word-based Carry Look-Ahead adder being responsible for the calculation of the final result after the pipeline has processed all bits of an operand as well as for performing single modular addition and subtraction operations. The register files of the original design have been replaced with an internal dual-ported RAM which holds the operands as well as intermediate results of the core operations. Furthermore a word-based comparator component has been described which is queried during operations to decide if a modular addition or subtraction step must be performed. Two additional -bit words for the operand and the exponent have been introduced with being the RAM width which will be fetched from RAM in case of Montgomery Multiplication and Montgomery Exponentiation operations. An auxiliary -bit word is used for RAM reorganisation operations as well as for the calculation of the Montgomery Parameters and .

The intelligence of the core is the controlling state machine which utilizes the defined components to perform standard modular addition and subtraction operations, Montgomery Multiplications, Montgomery Exponentiations, Montgomery Parameter calculation, and RAM reorganisation operations. Therefore it is responsible for controlling the RAM write and read access, the source and destination address signals of RAM, as well as the values passed through to the first processing unit, to the CLA adder, and to the comparator component. Furthermore it controls the assignments of operand, exponent, and words.

The described core can be parametrised in three ways. The parameter named MAX_PRECISION_WIDTH specifies the highest supported precision width , whereas the parameter WORD_WIDTH is used to specify the word width of the operands involved in the calculations. These two parameters determine the size and the address space of the internal core RAM. The third parameter MAX_NUM_PUS specifies the maximum number of processing units of the pipeline implemented for a specific core variation mainly affecting the performance and the size in terms of area consumption.

3.2.1. Processing Units

The heart of the core is the pipeline of processing units implementing the multiple word version of the Montgomery Multiplication algorithm. Therefore the processing unit structure has been described from scratch. The processing unit can be held in reset and keeps track of the cycle number according to the number of words to be processed depending on the supplied parameters. This control logic is needed to determine whether the supplied modulus has to be added to the processed words in this cycle or not, depending on the value of the signal denoting an odd intermediate result. Note that buffering the output of a processing unit between two processing units is not required in this design. Compared to the original design presented in [10] for a given precision width and a word size , number of words are required for a unified solution and the pipeline must consist of a power of two () number of processing units with a maximum number of in order to avoid pipeline stalls. Figure 2 illustrates the internal architecture of an exemplary processing unit with word size .

Each processing unit consists of a cascade of two layers of so-called Unified Full Adder (UFA) cells. The Unified Full Adder cells basically consist of simple full adder cells which have been enhanced by an additional finite field selection input . This allows for the creation of a unified multiplier architecture which can not only be used in prime fields () but also in binary fields () in which additions will be simple bitwise XOR calculations without any carry output.

3.2.2. Carry Look-Ahead Adder

Since the pipeline generates the result in carry save form, an additional step is necessary at the end of each calculation to obtain a nonredundant version of the result. For the sake of uniformity a circuit is required that can operate in both finite fields and . Furthermore, since the calculation in could require one further subtraction step, the Carry Look-Ahead adder in the design has been formulated to be able to perform word-based modular additions and subtractions. Figure 3 illustrates the logic of the proposed enhanced bit wide CLA adder of the core.

The internal signal of the second operand will be calculated as in which denotes an add-or-subtract signal ( means addition, represents subtraction by performing an addition in two’s complement representation). The modified CLA adder involves the same common Carry Look-Ahead adder logic for the calculation of the generate () and propagate () functions. The output values of the CLA adder logic will be calculated as for the least-significant bit and for all further bits. The final sum output bits will be calculated as the carry output bit will be determined as . If the selected finite field is (), then the add-or-subtract input will be ignored, the final sum will simply be the bitwise modulo-2 addition of the two input values and and the carry output bit will be forced to zero.

3.2.3. Core RAM Structure

The RAM of the core must be capable of holding all the necessary operands and intermediate values required during the execution of cryptographic algorithms. The basic structure of the described RAM is pictured in Figure 4.

It features four symbolic horizontal RAM operand locations with MAX_PRECISION_WIDTH bit each which are organized as eight pieces of MAX_PRECISION_WIDTH bit each. The location named is intended to hold operand in Montgomery Multiplication and Montgomery Exponentiation operations; the location named is intended to hold the modulus. The location usually holds the temporary sum value during Montgomery Multiplications and Montgomery Exponentiation or the first operand in modular addition or subtraction operations. The location usually holds the temporary carry stream during Montgomery Multiplications and Montgomery Exponentiation or the second operand in modular addition or subtraction operations.

Besides the horizontal RAM operand locations three symbolic vertical RAM operand locations with MAX_PRECISION_WIDTH bit each have been defined which are organized as eight pieces of MAX_PRECISION_WIDTH bit each. The locations named , , and for convenience usually are used to hold operand in Montgomery Multiplication and Montgomery Exponentiation operations as well as the exponent operand and the auxiliary operand in Montgomery Exponentiation operations. In addition all RAM slots are intended to hold intermediate values during the execution of cryptographic algorithms.

4. Functional Range of the Core

This section provides a description of the functional range of the proposed core. The following precisions (denoted in bit-length) are supported:(i)EC over , RSA, MR, DH: 192, 224, 256, 320, 384, 448, 512, 768, 1024, 1536, 2048, 3072, 4096(ii)EC over : 131, 163, 176, 191, 193, 208, 233, 239, 272, 283, 304, 359, 368, 409, 431, 571

If further or other precision widths should be supported, the described core can easily be adjusted in an appropriate manner. For the parametrisation and the execution/abortion of an operation a -bit wide command input word has been defined. Besides the start, abort, and finite field selection signals also the encoded precision width, operation code as well as RAM offsets for the specified operation can be supplied. The following operations have been specified.

4.1. MontMult Operation

The MontMult operation code instructs the core to perform a single Montgomery Multiplication with the supplied elements in the given finite field. A Montgomery Multiplication will start by reading the first -bit word of operand from RAM. Afterwards the pipeline will be started and the appropriate bits of operand will be fed to the individual processing unit. If all bits of the operand word have been fed to the processing units, a new word will be read from RAM. Once the last bit of operand has been processed, the temporary sum and temporary carry words will be fed into the CLA adder in order to reunite the two streams. After the last words of temporary sum and temporary carry have been brought together, the carry output bit of the CLA adder will be evaluated. If a carry bit is set the modulus will be subtracted once; otherwise the result will be compared to the given modulus. If the result is equal or greater than the modulus the given modulus will be subtracted once.

4.2. MontR Operation

The MontR operation code instructs the core to calculate the Montgomery Parameter regarding a supplied modulus in the given finite field, with being the bit-length of the given precision.

In the case of prime field arithmetic the Montgomery Parameter will be , so can be calculated as two’s complement of as bitwise inverse of the given modulus plus . Therefore the individual words of the modulus will be XOR-ed with a constant word consisting of all-ones. In addition the least-significant bit of the first word will be set to one.

In the case of binary field arithmetic the Montgomery Parameter will be , so is equal to binary expression of the irreducible polynomial with the most significant bit set to zero. Therefore the individual words of the modulus will be scanned and the appropriate most significant bit will be set to zero, depending on the given precision.

4.3. MontR2 Operation

The MontR2 operation code instructs the core to calculate the Montgomery Parameter with for a supplied modulus in the given finite field with being the bit-length of the given precision.

In the case of prime field arithmetic the Montgomery Parameter will be given by . Therefore in a first step the Montgomery Parameter will be calculated for prime fields as described above. In order to calculate one possible way is to calculate with being a small divider of . In the given implementation . Therefore the bits of will be shifted to the left by one bit. If the result is equal or greater than the modulus, will be subtracted once. By using a square-and-multiply-like algorithm, multiple Montgomery Multiplications will be performed in order to calculate .

In the case of binary field arithmetic the Montgomery Parameter will be given by . Therefore in a first step the Montgomery Parameter will be calculated for binary fields as described above. In order to calculate the resulting parameter will be shifted -times bitwise to the left. After each shift, the most significant bit as given by the precision parameter will be evaluated. If the bit is one, the irreducible polynomial will be added to the intermediate result which represents a modulo reduction with . Once the shift has been performed -times the result will be

4.4. MontExp Operation

The MontExp operation code instructs the core to perform a Montgomery Exponentiation consisting of multiple Montgomery Multiplication steps in the given finite field. A Montgomery Exponentiation will start by reading the first -bit word of exponent from RAM. Afterwards the first appearing one of the exponent word will be searched starting from the most significant bit. If the first word consists of all-zeros then the next word of exponent will be read and evaluated. Once the highest bit of exponent has been found, multiple Montgomery Multiplications will be performed until all bits of the exponent have been processed following a square-and-multiply algorithm.

4.5. ModAdd Operation

The ModAdd operation code instructs the core to perform a modular addition of the supplied elements in the given finite field. After preparing the core for the addition operation, the CLA adder will add the given operands using the appropriate arithmetic given by the finite field selection input. Once the last words of the given operands have been added the carry output bit of the CLA adder will be evaluated. If a carry bit is set, the modulus will be subtracted once; otherwise the result will be compared to the given modulus. If the result is equal to or greater than the modulus, it will also be subtracted once.

4.6. ModSub Operation

The ModSub operation code instructs the core to perform a modular subtraction of the supplied elements in prime fields. After preparing the core for the subtraction operation the CLA adder will be used to perform a word-based subtraction by performing an addition in two’s complement representation with prime field arithmetic. After the last words of the given operands have been processed, the carry output bit of the CLA adder will be evaluated. If the carry bit signals a negative result, the modulus will be added once; otherwise the result will be compared to the given modulus. If the result is equal to or greater than the modulus, it will be subtracted once.

4.7. RAM Copy Operations

In order to support cryptographic algorithms which have been disassembled into a list of instructions, RAM copy operations are needed. According to the proposed RAM layout stated above four individual copy operations have been defined.

The CopyH2V operation code instructs the core to copy a number of words, according to the supplied precision parameter, from the horizontal RAM layout starting from the given source address to the vertical RAM layout starting from the given destination address.

The CopyV2V operation code instructs the core to copy a number of words, according to the supplied precision parameter, from the vertical RAM layout starting from the given source address to the vertical RAM layout starting from the given destination address.

The CopyH2H operation code instructs the core to copy a number of words, according to the supplied precision parameter, from the horizontal RAM layout starting from the given source address to the horizontal RAM layout starting from the given destination address.

The CopyV2H operation code instructs the core to copy a number of words, according to the supplied precision parameter, from the vertical RAM layout starting from the given source address to the horizontal RAM layout starting from the given destination address.

4.8. MontMult1 Operation

The MontMult1 operation code instructs the core to perform a single Montgomery Multiplication of the supplied element with the constant in the given finite field. This type of operation is needed when a montgomerized value should be transformed back from the Montgomery Domain and has been implemented as an independent operation since an operand will unnecessarily occupy a vertical RAM slot. A Montgomery Multiplication with the constant will be executed in an analogous manner as the MontMult operation with the only exception that, instead of the RAM words, constant words will be used for the operand.

5. Exemplary Core Application Descriptions

This section gives exemplary descriptions of how the specified functional range of the proposed building-block Enhanced Montgomery Multiplication Core design can be utilized to support a wide range of cryptographic algorithms demanding the least possible memory capacity yet at the same time supporting as much precision widths as possible. Information is given of how to perform Chinese Remainder Theorem [22] (CRT) accelerated RSA private key operations and how to use the core in order to test/generate prime numbers.

For the support of elliptic curve cryptography over prime and binary finite fields modular functions are given for preparing and conducting point operations for arbitrary elliptic curves for the supported precision widths. For all these algorithms a list of operations and the quantity of different operations is given allowing to perform cryptographic algorithms by simply processing these operation lists.

5.1. CRT-Accelerated RSA Operation

In order to speed up RSA private key operations the CRT-accelerated version is also supported by the core. Therefore some operations have to be performed with full precision whereas most of the operations have to be performed with half precision. Algorithm 1 lists the necessary steps to utilize the core for CRT-accelerated RSA private key operations.

Requires: (, , , , , , , )
Calculates: ()
(h);
(f);
(h);
(h);
(h);
(h);
(f);
(h);
(h);
(h);
(h);
(h);
(h);
(f);
(f);
(f);
(f);
Provides: ()

Table 1 illustrates the abstract operations lists of the core for CRT-accelerated RSA application using the private key portion for all supported precision widths (512, 768, 1024, 1536, 2048, 3072, 4096). The number given in the index of the RAM locations denotes the offset given by the corresponding src_addr, dest_addr, src_addr_e, src_addr_x input signals. The width of the processed values depends on the supplied mwmac_precision input signal which depends on the operation. In the table operations requiring full precision (the precision of the RSA modulus) are marked by , operations requiring half precision are marked by . The mwmac_f_sel signal must be set to arithmetic.

CRT-accelerated RSA private key operations require , , , , , , , and performed on half precision and , , , and performed on full precision.

5.2. Prime Generation/Testing Operation

Algorithm 2 lists the necessary steps to utilize the core, in conjunction with a TRNG generator as Miller-Rabin Primality Tester. In the algorithm denotes the random integer to be tested for primality and denotes the confidence parameter determining the accuracy of the test, i.e., the amount of Miller-Rabin loops. In a precomputation step the parameters and with must be calculated which can be done by simple shift operations and counter increments in software. The test furthermore requires an amount of random integers serving as random bases.

Precomputation: (, with )
Input: (, , ,   )
Output: (, composite or probably prime)
;
;
for     from 1 to    do
;
;
;
if  orthen
continue;
for   fromtodo
;
;
if    then
return ( composite);
;
if    then
continue;
return ( composite);
return ( probably prime)

Table 2 illustrates the operations list of utilizing the core for Miller-Rabin Primality Test steps for all supported precision widths (192, 224, 256, 320, 384, 512, 768, 1024, 1536, 2048, 3072, 4096). The number given in the index of the RAM locations denotes the offset given by the corresponding src_addr, dest_addr, src_addr_e, src_addr_x input signals. The width of the processed values depends on the supplied mwmac_precision input signal. The mwmac_f_sel signal must be set to arithmetic. Note that since the results of the performed operations will be in the Montgomery Domain, they will be checked against the Montgomery Parameter and instead of and . Also note that the random bases that will be checked must not necessarily be transformed into the Montgomery Domain first, they simply will be interpreted as random montgomerized values.

The total number of needed core operations depends on the security parameter and the value resulting from the factorization of . Within the outer for loop from writing a new to the RAM until the evaluation of    , , , , , and and until evaluation of    , and operations are required. Within the inner for loop until evaluation of updated   , , , and and until evaluation of updated , and operations is required.

5.3. Elliptic Curve Operations

Unlike modular exponentiation which only is based on modular multiplications, elliptic curve Point Addition and Point Doubling operations also in the Jacobian projective coordinate representation [23] involve modular additions, subtractions, and multiplications. The algorithms for prime field elliptic curve Point Addition and Point Doubling using Jacobian coordinates furthermore involve multiplications by some constants. Since the described core performs multiplication operations by using Montgomery Arithmetic, these constants must be transformed into the Montgomery Domain first for the intermediate values to remain montgomerized.

In order to utilize the core for elliptic curve operations the following modular functions have been specified for both and support:(i)EC Preparation.(ii)EC Montgomery Transformation.(iii)EC Affine-to-Jacobi Transformation.(iv)EC Point Validation.(v)EC Point Doubling.(vi)EC Point Addition.(vii)EC Jacobi-to-Affine Transformation.(viii)EC Montgomery Backtransformation.

In the following, algorithms for utilizing the core to perform EC operations in are stated. For EC support, similar algorithms have been derived.

5.3.1. GF(p) EC Preparation

The prime field EC Preparation steps include the calculation of the Montgomery Parameter , the exponent as well as the montgomerized versions of the constants and the EC Domain Parameters and for a given elliptic curve over . Algorithm 3 lists the necessary steps to utilize the core for EC prime field preparation.

Requires: (, , , , , , )
Calculates: (, , , , , , , )
;
;
;
;
;
;
;
;
Provides: (, , , , , , , )

A core prime field EC preparation operation requires , , , , and .

5.3.2. GF(p) EC Montgomery Transformation

The prime field EC Montgomery Transformation steps are responsible for the transformation of the supplied affine point coordinates and of a Point in the case of a Point Doubling or Point Multiplication operation, and of the curve Points and in the case of a Point Addition operation into the Montgomery Domain. Algorithm 4 lists the steps to utilize the core for prime field EC Montgomery Transformation for an arbitrary curve Point .

Requires: (, , )
Calculates: (, )
;
;
Provides: (, )

A core prime field EC Montgomery Transformation operation requires , , and in the case of an intended Point Doubling or Point Multiplication operation and and in the case of an intended Point Addition operation.

5.3.3. GF(p) EC Affine-to-Jacobi Transformation

The prime field EC Affine-to-Jacobi Transformation steps are responsible for transforming the supplied montgomerized affine point coordinates and of a curve Point into Jacobian coordinates. Algorithm 5 lists the necessary steps to utilize the core for prime field EC Affine-to-Jacobi Transformation for an arbitrary montgomerized curve Point .

Requires: (, )
Calculates: (, , )
;
;
;
Provides: (, , )

A core prime field EC Affine-to-Jacobi Transformation operation requires , , and in the case of an intended Point Addition, Point Doubling, or Point Multiplication operation.

5.3.4. GF(p) EC Point Validation

The prime field EC Point Validation performs a check, if a supplied (or calculated) point indeed is a valid point of the elliptic curve given by the equation . As a requirement the Point Validation must be conducted on montgomerized points in affine coordinate representation. Algorithm 6 lists the necessary steps to utilize the core for prime field EC Point Validation.

Requires: (, , , , )
Output: (, point on curve or point not on curve)
;
;
;
;
;
;
;
if    then
return: ( point on curve);
else
return: ( point not on curve);

A core prime field EC Point Validation operation requires , , , , and .

5.3.5. GF(p) EC Point Doubling

The prime field EC Point Doubling steps perform a single Point Doubling operation of a Point with montgomerized Jacobi coordinates, resulting in also represented in montgomerized Jacobi coordinates. The original algorithm for Point Doubling with Jacobi coordinate representation has been modified to be suitable for the proposed core and is given in Algorithm 7.

Requires: (, , , , , , , , )
Calculates: (, , )
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
Provides: (, , )

A core prime field EC Point Doubling operation requires , , , , , and .

5.3.6. GF(p) EC Point Addition

The prime field EC Point Addition steps perform a single Point Addition operation of two Points and with montgomerized Jacobi coordinates, resulting in also represented in montgomerized Jacobi coordinates. The original algorithm for Point Addition with Jacobi coordinate representation has been modified to be suitable for the proposed core and is given in Algorithm 8.

Requires: (, , , , , , , )
Calculates: (, , )
;
;
;
;
;
;
;
;
;
;
if    then
  then
Notify:
Point_at_Infinity ();
else
Perform: PointDoubling Operation;
else
;
;
;
;
;
;
;
;
;
;
;
;
;
Provides: (, , )

A core prime field EC Point Addition operation requires , , , , and .

5.3.7. GF(p) EC Jacobi-to-Affine Transformation

The prime field EC Jacobi-to-Affine Transformation steps are responsible for the transformation of the supplied montgomerized Jacobi coordinates , , and of the curve Point back into affine coordinate representation. This transformation step requires the calculation of a modular multiplicative inverse element which will be performed by a Montgomery modular exponentiation according to Euler’s theorem since the modulus is a prime number. Algorithm 9 lists the necessary steps to utilize the core for prime field EC Jacobi-to-Affine Transformation.

Requires: (, , , , )
Calculates: (, )
;
;
;
;
;
Provides: (, )

A core prime field EC Jacobi-to-Affine Transformation operation requires , , , , and .

5.3.8. GF(p) EC Montgomery Backtransformation

The prime field EC Montgomery Backtransformation steps are responsible for the transformation of the supplied montgomerized point coordinates and of a Point out of the Montgomery Domain. Algorithm 10 lists the necessary steps to utilize the core for prime field EC Montgomery Backtransformation for an arbitrary curve Point .

Requires: (, )
Calculates: (, )
;
;
Provides: (, )

A core prime field EC Montgomery Backtransformation operation requires and .

6. Performance Analysis

In this section parameter-dependent formulas for the calculation of the computation times in clock cycles of the described basic core operations are given which allows specifying upper and lower calculation boundaries. Furthermore for the supported precision widths in both finite fields the number of words to be processed and the possible numbers of processing units is given. In order to estimate the size ratio of different core variations the number of logic elements and dedicated logic registers for exemplary Altera and Xilinx FPGAs is stated. Furthermore results of power estimation are given. Depending on the resulting clock cycle times of core variations a reference implementation exhibiting a balance of performance and area consumption has been defined. For this reference implementation the computation times in clock cycles for the described exemplary cryptographic algorithms are given.

6.1. Core Computation Time Formulas

Table 3 lists the RAM copy operations computation time formulas in clock cycles of the proposed core. Note that the resulting calculation times of RAM reorganisation operations are only dependent on the specified precision ( for and for ), the word width parameter for which the core variation has been generated and the resulting RAM width parameter with . The operations , , and exhibit the same computation time, whereas the operation will be performed in less clock cycles.

The computing time formulas of prime field core operations given in clock cycles are listed in Table 4.

The computation time of the operation in depends on the specified precision , the number of active processing units as well as the number of words running through the pipeline. In order to specify lower and upper computation times a best case and worst case formula is given. In the best case the carry-out bit of the CLA adder after reuniting and words is not set, the comparator only has to evaluate the most significant word, and a modular subtraction is not necessary. In the worst case the carry-out bit of the CLA adder is also not set but the comparator has to evaluate all words and a reduction of the resulting value is necessary.

The operation in only depends on the chosen precision and specified word width parameters.

For the operation computation time a best case and worst case formula is given. In the best case, after the shift operation, the comparator will only evaluate one word and an initial modular subtraction operation is not necessary. For the involved Montgomery Multiplication operations the best case formula is used. In the worst case, after the shift operation the comparator has to evaluate all words and decide that an initial modular subtraction operation is needed. For the involved Montgomery Multiplication operations the worst case formula is used. The amount of and of operations depends on the chosen precision. Table 5 lists the values for all supported precisions.

For the operation, computation time in a best case and worst case formula is given. In the best case the exponent operand is ; therefore only two Montgomery Multiplications and one operation is necessary. For the involved Montgomery Multiplication operations the best case formula is used. In the worst case the exponent is assumed to be ; therefore , and operations have to be performed. For the involved Montgomery Multiplication operations the worst case formula is used.

For the operation computation time a best case and worst case formula is given. In the best case, after the modular addition the CLA adder carry-out bit will not be set, the comparator will only have to evaluate one word and an additional modular subtraction is not needed. In the worst case the CLA adder carry-out bit will also not be set, but the comparator will have to evaluate all words to decide that an additional modular subtraction is necessary.

For the operation computation time a best case, worst case, and absolute worst case formula is given. In the best case, after the modular subtraction the CLA adder carry-out bit will not be set, the comparator will only have to evaluate one word and an additional modular subtraction is not needed. In the worst case, after the modular subtraction the CLA adder carry-out bit will be set and a modular addition must be performed. In the absolute worst case after the modular subtraction the CLA adder carry-out bit will not be set, the comparator will evaluate all words, and an additional modular subtraction step is necessary. Note that this will only occur if the resulting value after the first subtraction operation will be identical to the modulus, which under normal operation conditions will not be the case.

The prime field operation is identical to the operation; therefore the same best and worst case formulas apply.

The computing time formulas of binary field core operations given in clock cycles are listed in Table 6.

The computation time of the operation in depends on the specified precision parameter , the number of active processing units , and the number of words running through the pipeline. Since the additions in are simple XOR-operations and the most significant bit of the resulting value will never be set after calculation only one formula is given.

While the determination of the Montgomery Parameter in differs from the calculation rule for it also only depends on the chosen precision and specified word width parameters.

The operation in is based on shifts and possible modular additions whenever the most significant bit of the intermediate value will be set after a shift. The amount of modular additions depends on the Montgomery Parameter which itself depends on the irreducible polynomial. In order to specify lower and upper computation times a best case and worst case formula is given. The best case assumes that no modular addition operation is required at all, whereas the worst case assumes that a modular addition operation is required after each shift operation.

For the operation computation time a best case and worst case formula is given. In the best case the exponent operand is therefore only two Montgomery Multiplications and one operation is necessary. In the worst case the exponent is assumed to be therefore , and operations are required.

For the operation computation time a best case and absolute worst case formula is given. In the best case the comparator will only have to evaluate one word. In the absolute worst case the comparator will have to evaluate all words to decide that an additional modular addition is necessary. Note that this will only occur if the resulting value after the first addition operation will be identical to the modulus polynomial, which under normal operation conditions will not be the case.

The binary field operation is identical to the operation; therefore the same formula applies.

6.2. Core Variations

Depending on the needs, in terms of performance, area consumption, supported precisions, and the interfacing structure, different variations of the core can be generated by defining the parameters MAX_PRECISION_WIDTH, WORD_WIDTH and MAX_NUM_PUS. Table 7 lists the resulting number of words and the possible number of processing units for the supported prime field precisions and typical word widths of , and bit.

In contrast, Table 8 lists the resulting number of words and the number of possible processing units for the supported binary field precisions and typical word widths of , , and bits. Note that the number of possible processing units for binary fields within the defined core is subjected to a further constraint. Once all bits of operand have been processed the remaining processing units in the pipeline must be bypassed and the and words must be directly fed into the CLA adder. Since the result of the CLA adder will be written back to RAM but remaining words must still be read from RAM and fed into the first processing unit, the RAM source and destination signals must never address the same memory location at one time. Therefore the equation must hold true to , meaning that no processing unit will be bypassed, or , meaning that the very last processing unit will be bypassed at the last cycle of operand bits.

6.3. Core Hardware Footprint

Since all components of the design consist of simple logic elements, the proposed arithmetic core is vendor-neutral. In order to estimate the hardware footprint of different core implementations the design variations have been compiled on Altera and Xilinx FPGAs. Table 9 lists the amount of total logic elements and comprised logic registers for varied values of WORD_WIDTH () and MAX_NUM_PUS generated for an Altera Cyclone IV (EP4CE115F29C9L) device featuring logic elements and memory bits.

Table 10 lists the amount of total logic elements and comprised logic registers for varied values of WORD_WIDTH () and MAX_NUM_PUS generated for an Xilinx XC7Z020 (xc7z020clg484-1) device featuring logic elements and registers. The resulting values demonstrate that the design can compete with other proposed designs, for instance, the one compared in [14, 15]. Furthermore instead of being restricted to only one cryptographic application, the core can handle various algorithms. According to the needs, in terms of area, a suitable solution for a specific implementation can be chosen. The choice will have an impact on power consumption and computing time.

6.4. Core Power Estimation

In order to evaluate the suitability of the proposed core for the application in the IoT area, a power estimation has been conducted using two common frequencies of 100 MHz and 200 MHz for various core variations. Timing analysis yields that the design can reliably be operated with these frequencies. The power consumption characteristics have been derived by applying the PowerPlay Power Analyzer Tool of the Quartus Prime IDE to the final design using default settings of a power toggle rate as well as a power input I/O toggle rate of , using a vectorless estimation and a board temperature of 25°C. Table 11 lists the Total Thermal Power Dissipation values for varied WORD_WIDTH () and MAX_NUM_PUS parameters generated for the Altera Cyclone IV (EP4CE115F29C9L) device. The values are comparable to the ones given in [24] for RSA calculation.

Furthermore it has to be mentioned that the optimization mode in the compiler settings was set to balanced and no specific compiler optimizations regarding power have been turned on. The results show that the core is quite suitable for applications which have special constraints regarding power consumption. According to such needs as well as the desired clock frequency a suitable variation can be implemented. The choice will have an impact on computing time and hardware footprint.

6.5. Core Reference Implementation

For the reference implementation a word width of WORD_WIDTH = 32 bit was chosen and the maximum number of processing units of the pipeline was set to MAX_NUM_PUS = 32. The maximum supported precision width parameter MAX_PRECISION_WIDTH was set to leading to a RAM consisting of bits. Table 12 lists the computation time in clock cycles of the reference implementation for RSA application. For RSA public-key operations best case and worst case computation times are given under the assumption that the public exponent is . Therefore during the operation a total of , and operations will be performed. Since the private exponent is different for varied RSA keys only worst case computation times for the supported precision widths are given. The worst case RSA private key and CRT-accelerated private key computation times assume the worst case clock cycle times of the underlying operations given in previous section.

Table 13 lists the worst computation times in clock cycles of the reference implementation for Miller-Rabin prime testing application for one iteration. Note that the most time consuming operation is part one of the outer loop of Algorithm 2 which will always be performed for each iteration. Depending on the evaluation of the result it might be necessary to execute part two of the outer loop. Furthermore depending on the structure of the prime in question it might be necessary to execute part one and two of the inner loop multiple times.

Table 14 lists the computation time in clock cycles of the reference implementation for prime field EC operations for all supported precision widths. The Affine-to-Jacobi Transformation step requires a precision dependent number of clock cycles. For the remaining steps worst case clock cycle times are given. For the Point Multiplication operation an absolute worst case computation time is stated in which a theoretical scalar is hypothesized to be , therefore a maximum of Point Doubling and Point Addition operations would be necessary assuming a simple double and add algorithm.

7. Conclusion and Future Work

A comprehensive adaptable hardware structure for efficient prime finite field and binary finite field arithmetic operations that expand the capabilities of single Montgomery Multiplier hardware designs has been proposed which allows carrying out cryptographic calculations for a large range of different algorithms all based on the same arithmetic unit operations with arbitrary parameters. The approach taken by the proposed core is to combine standard modulo addition / subtraction support with the capability of performing Montgomery Multiplications, full Montgomery Exponentiations, and the calculation of Montgomery Parameters and for arbitrary moduli, bringing together all required arithmetic operations for carrying out a wide range of cryptographic algorithms used today. Through the breakdown of these algorithms individual operation lists have been derived for the arithmetic unit rendering extra precomputations in software unnecessary.

The given values of possible hardware footprint and power consumption for specific core variations allow choosing the proper configuration for a specific implementation. The reference implementation showed that with an internal RAM of merely 3.5 kB the core is capable of performing complete prime field and binary field EC operations for various precision widths of standardised curves. Furthermore the same core configuration is capable of performing (CRT-accelerated) RSA operations for typical precision widths required today, (safe) prime testing/generation, and Diffie-Hellman key exchange operations up to bit precision widths. The design should further be optimized in terms of power consumption.

However the type of implementation of some core operations, such as the Montgomery Multiplication and especially the Montgomery Exponentiation operation, necessitates additional security considerations, since the calculation times depend on the structure of the processed operands. This makes the design prone to side-channel attacks if security sensitive information, such as private keys, will be processed. But not all operations are critical and must be secured, such as the calculation of the Montgomery Parameters. Therefore during the writing of this article the core will be enhanced to provide a secure calculation bit within the command input word, which, if set, instructs the core to perform the specified arithmetic operation in a time-invariant fashion. In addition, special care has to be taken when defining core operation lists, for instance, for performing elliptic curve Point Multiplication operations. Descriptions performing in a fixed amount of time, e.g., the Montgomery ladder [25], mitigating the risk of timing, and power analysis attacks must be chosen.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.