Research Article  Open Access
Martin Schramm, Reiner Dojen, Michael Heigl, "A VendorNeutral Unified Core for Cryptographic Operations in GF(p) and GF() Based on Montgomery Arithmetic", Security and Communication Networks, vol. 2018, Article ID 4983404, 18 pages, 2018. https://doi.org/10.1155/2018/4983404
A VendorNeutral Unified Core for Cryptographic Operations in GF(p) and GF() Based on Montgomery Arithmetic
Abstract
In the emerging IoT ecosystem in which the internetworking will reach a totally new dimension the crucial role of efficient security solutions for embedded devices will be without controversy. Typically IoTenabled devices are equipped with integrated circuits, such as ASICs or FPGAs to achieve highly specific tasks. Such devices must have cryptographic layers implemented and must be able to access cryptographic functions for encrypting/decrypting and signing/verifying data using various algorithms and generate true random numbers, random primes, and cryptographic keys. In the context of a limited amount of resources that typical IoT devices will exhibit, due to energy efficiency requirements, efficient hardware structures in terms of time, area, and power consumption must be deployed. In this paper, we describe a scalable wordbased multivendorcapable cryptographic core, being able to perform arithmetic operations in prime and binary extension finite fields based on Montgomery Arithmetic. The functional range comprises the calculation of modular additions and subtractions, the determination of the Montgomery Parameters, and the execution of Montgomery Multiplications and Montgomery Exponentiations. A prototype implementation of the adaptable arithmetic core is detailed. Furthermore, the decomposition of cryptographic algorithms to be used together with the proposed core is stated and a performance analysis is given.
1. Introduction
The next generation of embedded systems and IoT devices will exhibit a much higher degree of internetworking which gives rise to security considerations [1]. As a logical consequence, such devices must become cryptographic nodes, besides others, being capable of encrypting/decrypting and signing/verifying data as well as establishing spontaneous secured communications by exchanging common secrets used for secret key calculation. While many embedded chips already have support for hardwareaccelerated symmetric algorithms (mainly AES) [2] and hash functions, due to various reasons, such as complexity, space, and costs, they lack in hardware support especially for supporting a wide range of publickey and key exchange algorithms with different precision widths. Besides, many modern cryptographic primitives necessitate the capability for producing true random numbers and random prime numbers. Typical IoT devices furthermore very often only exhibit a limited amount of resources which requires efficient cryptographic hardware structures in terms of area, power consumption, and calculation performance [3]. In general enterprises developing IoT products basically have three options to include application functionalities in high integrated devices, using Application Specific Standard Products (ASSP), Application Specific Integrated Circuits (ASIC), or Field Programmable Gate Arrays (FPGA). Today FPGAs have become promising components for IoT applications [4], compared to ASSP solutions which often cannot provide the required functionality and can provide a better Total Cost of Ownership (TCO) compared to ASIC solutions. Thus for devices which are equipped with a FPGA device, it is valuable to examine how efficient hardware structures for performing cryptographic operations can be included.
In matters of algorithm agility an arithmetic engine with minimal hardware footprint, which can handle the arithmetic operations of a great variety of cryptographic algorithms, is of great importance for IoT based devices. Especially the calculability of the individual operations leading to lower and upper calculation time bounds is quite important.
This paper proposes a tinyheld vendorneutral cryptographic arithmetic core exemplarily implemented in FPGAlogic. For efficiency, timeintensive modular operations, such as multiplication and exponentiation operations, Montgomery Arithmetic is used. Without the need of any expensive software precalculations the core is able to perform a high number of cryptographic algorithms and handle various key sizes by simply processing operation lists. Furthermore the core architecture is unified and can perform calculations in both prime finite fields () and binary extension fields (). To illustrate the versatility of the developed core, wellestablished cryptographic algorithms have been rewritten and fragmented into operation lists to be processed by the arithmetic engine.
The paper is organized as follows. Section 2 states the related work of this research. In Section 3 the design of the proposed Enhanced Montgomery Multiplication Core is stated; the specified functional range of the core is given in Section 4. In Section 5 some exemplary application descriptions for the core are mentioned and in Section 6 the results of the performance analysis are stated. Finally, Section 7 concludes the paper.
2. Related Works
The efficiency of cryptographic algorithms when implemented on reconfigurable hardware is mainly determined by the fact of how the underlying finite field arithmetic operations are realized [5]. Several applications in cryptography such as ciphering and deciphering of asymmetric algorithms, the creation and verification of digital signatures, and secure key exchange mechanisms require excessive use of the basic finite field modular arithmetic operations addition, multiplication, and the calculation of the multiplicative inverse. Especially the field multiplication operation is crucial to the efficiency of a design, since it is the core operation of many cryptographic algorithms [6].
In [7] P. L. Montgomery introduced a representation of residue classes in order to speed up modular multiplications without affecting modular additions and subtractions. Over the years numerous designs have been proposed implementing modular multiplications based on Montgomery’s multiplication algorithm [8]. The foundation for these architectures was presented by A. Tenca and Ç. Koç in [9]. The architecture is based on a wordbased Montgomery Multiplication algorithm for prime finite fields in which multiplications are performed in a bitserial fashion. E. Savaş et al. in [10] have proposed an extension which, in addition to the standard integer modulo arithmetic, also allows polynomial computations over binary finite fields. An overview about algorithms and hardware architectures for Montgomery Multiplication can be found in [11]. Optimizations of the original design have been proposed concerning the hardware implementation of the Montgomery Multiplication algorithm [12] as well as by utilizing special arithmetic hardcore extensions of FPGAs to accelerate digital signal processing applications [13]. Some designs only focus on utilizing the Montgomery Multiplication method to accelerate modular exponentiation operations as required by the RSA algorithm [14, 15].
However, no publication focuses on how the Montgomery Multiplication architecture can be embedded into a comprehensive solution. In this paper we propose an enhanced version of a bitserial wordbased unified Montgomery Multiplication core based on logic elements only which is controlled by a state machine and offers the functional range to be able to perform complete cryptographic algorithms without additional complex processing required in software.
3. Enhanced Montgomery Multiplication Core
3.1. Requirements
Today a high number of different publickey algorithms are in use. To ensure compatibility, cryptographic applications must support a large portion of those algorithms. While typical software implementations often can easily be upgraded in order to adapt new algorithms and larger key sizes, the same is not necessarily true for hardware implementations. Therefore following requirements have been identified for the Enhanced Montgomery Multiplication Core:(i)Use of Montgomery Arithmetic. The design must be able to perform modulo operations in a timeefficient manner by using Montgomery Arithmetic. At least the core must support Montgomery Multiplications and Montgomery Exponentiations. Furthermore the core must support standard modulo additions and modulo subtractions.(ii)Works on Both Finite Fields and . The architecture must exhibit an unified structure supporting both standard integer modulo operations of prime finite fields as well as polynomial calculations of binary finite fields.(iii)Montgomery Parameter Calculation. In general the Montgomery Parameters ( and ) can be precomputed for previously known moduli. However, as a requirement the core must be able to handle arbitrary moduli. Therefore it must be capable of calculating the Montgomery Parameters , and without the need of precalculations done in software.(iv)Scalable Design. The architecture must be scalable in terms of timing, area, and power consumption. This includes the parametrisation of the word width, the internal storage size, and the amount of processing units within the pipeline.(v)Multialgorithm Support. The core must be based on a buildingblock design. The functional range provided by the arithmetic unit should empower algorithm agility, by fragmenting cryptographic algorithms into a list of core operations. At least the core must be capable of performing RSA [16] operations, (safe) prime number generation and primality testing (MR) [17, 18], key exchange operations (DH) [19], and elliptic curve calculations (EC) [20] over both prime and binary finite fields.(vi)Supporting as Many Precision Widths as Possible. The design must support a wide range of different precision widths determining the security level of the cryptographic algorithm. If a certain security level, due to increased attacking computing power, becomes inadequate, the precision width can be adjusted accordingly which makes the hardware less prone to become obsolete due to higher security demands. The core must support the current recommendations for minimum key sizes [21] and should also support larger key sizes. For RSA algorithm and DiffieHellman key exchange support the architecture should be able to handle precisions up to bit moduli, for elliptic curve cryptography support precisions up to bits for prime finite fields and precisions up to bits for binary finite fields should be possible.(vii)TimeInvariant Operations. The architecture must be capable of performing its operations in a timeinvariant manner. If security sensitive information, such as private keys, will be processed, it must be ensured that all operations exhibit the same execution time to prevent sidechannel attacks based on timing analysis.
3.2. Overall Core Architecture
Figure 1 illustrates the overall architecture of the proposed Enhanced Montgomery Multiplication Core which is capable of meeting all requirements as specified above.
Besides the pipeline of processing units handling the main part of the wordbased Montgomery Multiplication algorithm, the core features an enhanced wordbased Carry LookAhead adder being responsible for the calculation of the final result after the pipeline has processed all bits of an operand as well as for performing single modular addition and subtraction operations. The register files of the original design have been replaced with an internal dualported RAM which holds the operands as well as intermediate results of the core operations. Furthermore a wordbased comparator component has been described which is queried during operations to decide if a modular addition or subtraction step must be performed. Two additional bit words for the operand and the exponent have been introduced with being the RAM width which will be fetched from RAM in case of Montgomery Multiplication and Montgomery Exponentiation operations. An auxiliary bit word is used for RAM reorganisation operations as well as for the calculation of the Montgomery Parameters and .
The intelligence of the core is the controlling state machine which utilizes the defined components to perform standard modular addition and subtraction operations, Montgomery Multiplications, Montgomery Exponentiations, Montgomery Parameter calculation, and RAM reorganisation operations. Therefore it is responsible for controlling the RAM write and read access, the source and destination address signals of RAM, as well as the values passed through to the first processing unit, to the CLA adder, and to the comparator component. Furthermore it controls the assignments of operand, exponent, and words.
The described core can be parametrised in three ways. The parameter named MAX_PRECISION_WIDTH specifies the highest supported precision width , whereas the parameter WORD_WIDTH is used to specify the word width of the operands involved in the calculations. These two parameters determine the size and the address space of the internal core RAM. The third parameter MAX_NUM_PUS specifies the maximum number of processing units of the pipeline implemented for a specific core variation mainly affecting the performance and the size in terms of area consumption.
3.2.1. Processing Units
The heart of the core is the pipeline of processing units implementing the multiple word version of the Montgomery Multiplication algorithm. Therefore the processing unit structure has been described from scratch. The processing unit can be held in reset and keeps track of the cycle number according to the number of words to be processed depending on the supplied parameters. This control logic is needed to determine whether the supplied modulus has to be added to the processed words in this cycle or not, depending on the value of the signal denoting an odd intermediate result. Note that buffering the output of a processing unit between two processing units is not required in this design. Compared to the original design presented in [10] for a given precision width and a word size , number of words are required for a unified solution and the pipeline must consist of a power of two () number of processing units with a maximum number of in order to avoid pipeline stalls. Figure 2 illustrates the internal architecture of an exemplary processing unit with word size .
Each processing unit consists of a cascade of two layers of socalled Unified Full Adder (UFA) cells. The Unified Full Adder cells basically consist of simple full adder cells which have been enhanced by an additional finite field selection input . This allows for the creation of a unified multiplier architecture which can not only be used in prime fields () but also in binary fields () in which additions will be simple bitwise XOR calculations without any carry output.
3.2.2. Carry LookAhead Adder
Since the pipeline generates the result in carry save form, an additional step is necessary at the end of each calculation to obtain a nonredundant version of the result. For the sake of uniformity a circuit is required that can operate in both finite fields and . Furthermore, since the calculation in could require one further subtraction step, the Carry LookAhead adder in the design has been formulated to be able to perform wordbased modular additions and subtractions. Figure 3 illustrates the logic of the proposed enhanced bit wide CLA adder of the core.
The internal signal of the second operand will be calculated as in which denotes an addorsubtract signal ( means addition, represents subtraction by performing an addition in two’s complement representation). The modified CLA adder involves the same common Carry LookAhead adder logic for the calculation of the generate () and propagate () functions. The output values of the CLA adder logic will be calculated as for the leastsignificant bit and for all further bits. The final sum output bits will be calculated as the carry output bit will be determined as . If the selected finite field is (), then the addorsubtract input will be ignored, the final sum will simply be the bitwise modulo2 addition of the two input values and and the carry output bit will be forced to zero.
3.2.3. Core RAM Structure
The RAM of the core must be capable of holding all the necessary operands and intermediate values required during the execution of cryptographic algorithms. The basic structure of the described RAM is pictured in Figure 4.
It features four symbolic horizontal RAM operand locations with MAX_PRECISION_WIDTH bit each which are organized as eight pieces of MAX_PRECISION_WIDTH bit each. The location named is intended to hold operand in Montgomery Multiplication and Montgomery Exponentiation operations; the location named is intended to hold the modulus. The location usually holds the temporary sum value during Montgomery Multiplications and Montgomery Exponentiation or the first operand in modular addition or subtraction operations. The location usually holds the temporary carry stream during Montgomery Multiplications and Montgomery Exponentiation or the second operand in modular addition or subtraction operations.
Besides the horizontal RAM operand locations three symbolic vertical RAM operand locations with MAX_PRECISION_WIDTH bit each have been defined which are organized as eight pieces of MAX_PRECISION_WIDTH bit each. The locations named , , and for convenience usually are used to hold operand in Montgomery Multiplication and Montgomery Exponentiation operations as well as the exponent operand and the auxiliary operand in Montgomery Exponentiation operations. In addition all RAM slots are intended to hold intermediate values during the execution of cryptographic algorithms.
4. Functional Range of the Core
This section provides a description of the functional range of the proposed core. The following precisions (denoted in bitlength) are supported:(i)EC over , RSA, MR, DH: 192, 224, 256, 320, 384, 448, 512, 768, 1024, 1536, 2048, 3072, 4096(ii)EC over : 131, 163, 176, 191, 193, 208, 233, 239, 272, 283, 304, 359, 368, 409, 431, 571
If further or other precision widths should be supported, the described core can easily be adjusted in an appropriate manner. For the parametrisation and the execution/abortion of an operation a bit wide command input word has been defined. Besides the start, abort, and finite field selection signals also the encoded precision width, operation code as well as RAM offsets for the specified operation can be supplied. The following operations have been specified.
4.1. MontMult Operation
The MontMult operation code instructs the core to perform a single Montgomery Multiplication with the supplied elements in the given finite field. A Montgomery Multiplication will start by reading the first bit word of operand from RAM. Afterwards the pipeline will be started and the appropriate bits of operand will be fed to the individual processing unit. If all bits of the operand word have been fed to the processing units, a new word will be read from RAM. Once the last bit of operand has been processed, the temporary sum and temporary carry words will be fed into the CLA adder in order to reunite the two streams. After the last words of temporary sum and temporary carry have been brought together, the carry output bit of the CLA adder will be evaluated. If a carry bit is set the modulus will be subtracted once; otherwise the result will be compared to the given modulus. If the result is equal or greater than the modulus the given modulus will be subtracted once.
4.2. MontR Operation
The MontR operation code instructs the core to calculate the Montgomery Parameter regarding a supplied modulus in the given finite field, with being the bitlength of the given precision.
In the case of prime field arithmetic the Montgomery Parameter will be , so can be calculated as two’s complement of as bitwise inverse of the given modulus plus . Therefore the individual words of the modulus will be XORed with a constant word consisting of allones. In addition the leastsignificant bit of the first word will be set to one.
In the case of binary field arithmetic the Montgomery Parameter will be , so is equal to binary expression of the irreducible polynomial with the most significant bit set to zero. Therefore the individual words of the modulus will be scanned and the appropriate most significant bit will be set to zero, depending on the given precision.
4.3. MontR2 Operation
The MontR2 operation code instructs the core to calculate the Montgomery Parameter with for a supplied modulus in the given finite field with being the bitlength of the given precision.
In the case of prime field arithmetic the Montgomery Parameter will be given by . Therefore in a first step the Montgomery Parameter will be calculated for prime fields as described above. In order to calculate one possible way is to calculate with being a small divider of . In the given implementation . Therefore the bits of will be shifted to the left by one bit. If the result is equal or greater than the modulus, will be subtracted once. By using a squareandmultiplylike algorithm, multiple Montgomery Multiplications will be performed in order to calculate .
In the case of binary field arithmetic the Montgomery Parameter will be given by . Therefore in a first step the Montgomery Parameter will be calculated for binary fields as described above. In order to calculate the resulting parameter will be shifted times bitwise to the left. After each shift, the most significant bit as given by the precision parameter will be evaluated. If the bit is one, the irreducible polynomial will be added to the intermediate result which represents a modulo reduction with . Once the shift has been performed times the result will be
4.4. MontExp Operation
The MontExp operation code instructs the core to perform a Montgomery Exponentiation consisting of multiple Montgomery Multiplication steps in the given finite field. A Montgomery Exponentiation will start by reading the first bit word of exponent from RAM. Afterwards the first appearing one of the exponent word will be searched starting from the most significant bit. If the first word consists of allzeros then the next word of exponent will be read and evaluated. Once the highest bit of exponent has been found, multiple Montgomery Multiplications will be performed until all bits of the exponent have been processed following a squareandmultiply algorithm.
4.5. ModAdd Operation
The ModAdd operation code instructs the core to perform a modular addition of the supplied elements in the given finite field. After preparing the core for the addition operation, the CLA adder will add the given operands using the appropriate arithmetic given by the finite field selection input. Once the last words of the given operands have been added the carry output bit of the CLA adder will be evaluated. If a carry bit is set, the modulus will be subtracted once; otherwise the result will be compared to the given modulus. If the result is equal to or greater than the modulus, it will also be subtracted once.
4.6. ModSub Operation
The ModSub operation code instructs the core to perform a modular subtraction of the supplied elements in prime fields. After preparing the core for the subtraction operation the CLA adder will be used to perform a wordbased subtraction by performing an addition in two’s complement representation with prime field arithmetic. After the last words of the given operands have been processed, the carry output bit of the CLA adder will be evaluated. If the carry bit signals a negative result, the modulus will be added once; otherwise the result will be compared to the given modulus. If the result is equal to or greater than the modulus, it will be subtracted once.
4.7. RAM Copy Operations
In order to support cryptographic algorithms which have been disassembled into a list of instructions, RAM copy operations are needed. According to the proposed RAM layout stated above four individual copy operations have been defined.
The CopyH2V operation code instructs the core to copy a number of words, according to the supplied precision parameter, from the horizontal RAM layout starting from the given source address to the vertical RAM layout starting from the given destination address.
The CopyV2V operation code instructs the core to copy a number of words, according to the supplied precision parameter, from the vertical RAM layout starting from the given source address to the vertical RAM layout starting from the given destination address.
The CopyH2H operation code instructs the core to copy a number of words, according to the supplied precision parameter, from the horizontal RAM layout starting from the given source address to the horizontal RAM layout starting from the given destination address.
The CopyV2H operation code instructs the core to copy a number of words, according to the supplied precision parameter, from the vertical RAM layout starting from the given source address to the horizontal RAM layout starting from the given destination address.
4.8. MontMult1 Operation
The MontMult1 operation code instructs the core to perform a single Montgomery Multiplication of the supplied element with the constant in the given finite field. This type of operation is needed when a montgomerized value should be transformed back from the Montgomery Domain and has been implemented as an independent operation since an operand will unnecessarily occupy a vertical RAM slot. A Montgomery Multiplication with the constant will be executed in an analogous manner as the MontMult operation with the only exception that, instead of the RAM words, constant words will be used for the operand.
5. Exemplary Core Application Descriptions
This section gives exemplary descriptions of how the specified functional range of the proposed buildingblock Enhanced Montgomery Multiplication Core design can be utilized to support a wide range of cryptographic algorithms demanding the least possible memory capacity yet at the same time supporting as much precision widths as possible. Information is given of how to perform Chinese Remainder Theorem [22] (CRT) accelerated RSA private key operations and how to use the core in order to test/generate prime numbers.
For the support of elliptic curve cryptography over prime and binary finite fields modular functions are given for preparing and conducting point operations for arbitrary elliptic curves for the supported precision widths. For all these algorithms a list of operations and the quantity of different operations is given allowing to perform cryptographic algorithms by simply processing these operation lists.
5.1. CRTAccelerated RSA Operation
In order to speed up RSA private key operations the CRTaccelerated version is also supported by the core. Therefore some operations have to be performed with full precision whereas most of the operations have to be performed with half precision. Algorithm 1 lists the necessary steps to utilize the core for CRTaccelerated RSA private key operations.

Table 1 illustrates the abstract operations lists of the core for CRTaccelerated RSA application using the private key portion for all supported precision widths (512, 768, 1024, 1536, 2048, 3072, 4096). The number given in the index of the RAM locations denotes the offset given by the corresponding src_addr, dest_addr, src_addr_e, src_addr_x input signals. The width of the processed values depends on the supplied mwmac_precision input signal which depends on the operation. In the table operations requiring full precision (the precision of the RSA modulus) are marked by , operations requiring half precision are marked by . The mwmac_f_sel signal must be set to arithmetic.

CRTaccelerated RSA private key operations require , , , , , , , and performed on half precision and , , , and performed on full precision.
5.2. Prime Generation/Testing Operation
Algorithm 2 lists the necessary steps to utilize the core, in conjunction with a TRNG generator as MillerRabin Primality Tester. In the algorithm denotes the random integer to be tested for primality and denotes the confidence parameter determining the accuracy of the test, i.e., the amount of MillerRabin loops. In a precomputation step the parameters and with must be calculated which can be done by simple shift operations and counter increments in software. The test furthermore requires an amount of random integers serving as random bases.

Table 2 illustrates the operations list of utilizing the core for MillerRabin Primality Test steps for all supported precision widths (192, 224, 256, 320, 384, 512, 768, 1024, 1536, 2048, 3072, 4096). The number given in the index of the RAM locations denotes the offset given by the corresponding src_addr, dest_addr, src_addr_e, src_addr_x input signals. The width of the processed values depends on the supplied mwmac_precision input signal. The mwmac_f_sel signal must be set to arithmetic. Note that since the results of the performed operations will be in the Montgomery Domain, they will be checked against the Montgomery Parameter and instead of and . Also note that the random bases that will be checked must not necessarily be transformed into the Montgomery Domain first, they simply will be interpreted as random montgomerized values.

The total number of needed core operations depends on the security parameter and the value resulting from the factorization of . Within the outer for loop from writing a new to the RAM until the evaluation of , , , , , and and until evaluation of , and operations are required. Within the inner for loop until evaluation of updated , , , and and until evaluation of updated , and operations is required.
5.3. Elliptic Curve Operations
Unlike modular exponentiation which only is based on modular multiplications, elliptic curve Point Addition and Point Doubling operations also in the Jacobian projective coordinate representation [23] involve modular additions, subtractions, and multiplications. The algorithms for prime field elliptic curve Point Addition and Point Doubling using Jacobian coordinates furthermore involve multiplications by some constants. Since the described core performs multiplication operations by using Montgomery Arithmetic, these constants must be transformed into the Montgomery Domain first for the intermediate values to remain montgomerized.
In order to utilize the core for elliptic curve operations the following modular functions have been specified for both and support:(i)EC Preparation.(ii)EC Montgomery Transformation.(iii)EC AffinetoJacobi Transformation.(iv)EC Point Validation.(v)EC Point Doubling.(vi)EC Point Addition.(vii)EC JacobitoAffine Transformation.(viii)EC Montgomery Backtransformation.
In the following, algorithms for utilizing the core to perform EC operations in are stated. For EC support, similar algorithms have been derived.
5.3.1. GF(p) EC Preparation
The prime field EC Preparation steps include the calculation of the Montgomery Parameter , the exponent as well as the montgomerized versions of the constants and the EC Domain Parameters and for a given elliptic curve over . Algorithm 3 lists the necessary steps to utilize the core for EC prime field preparation.
