Abstract
This paper describes a comparison of two Montgomery modular multiplication architectures: a systolic and a multiplexed. Both implementations target FPGA devices. The modular multiplication is employed in modular exponentiation processes, which are the most important operations of some publickey cryptographic algorithms, including the most popular of them, the RSA. The proposed systolic architecture presents a highradix implementation with a onedimensional array of Processing Elements. The multiplexed implementation is a new alternative and is composed of multiplier blocks in parallel with the new simplified Processing Elements, and it provides a pipelined operation mode. We compare the time × area efficiency for both architectures as well as an RSA application. The systolic implementation can run the 1024 bits RSA decryption process in just 3.23 ms, and the multiplexed architecture executes the same operation in 4.36 ms, but the second approach saves up to 28% of logical resources. These results are competitive with the stateoftheart performance.
1. Introduction
Modular multiplication is widely employed in publickey cryptography, especially where modular exponentiation is essential. For instance, the most commonly used asymmetric cryptographic algorithm is the RSA [1]. The RSA security depends on the difficulty of factoring large numbers. Here, large numbers mean prime numbers of up to 4096 bits, used as cryptographic keys.
In this cryptosystem the main operation is the modular exponentiation using the public and private keys, the first to encrypt and the second to decrypt messages. So, the performance of the whole system depends on the efficiency of modular arithmetic implementations.
As modular operations are time consuming, it is common to use hardware devices to perform both the modular multiplication and the exponentiation. Among the hardware approaches, the increased use of reconfigurable devices to implement cryptographic operations, especially the FPGAs, is evident.
One of the most suitable methods for performing modular multiplications in hardware is the Montgomery multiplication [2]. This algorithm is fast and power efficient in hardware implementations. Assuming the modular multiplication as , the Montgomery multiplication avoids the division by by replacing the division by right shifts. Also, this method allows the use of multiprecision arithmetic, which is useful for employing highradix operations. Highradix operations in turn make it easier to develop modular multiplication architectures.
Aiming to implement RSA systems based on hardware, many authors proposed Montgomery multiplications in FPGAs [3–9]. Fully systolic architectures designed to speed up the modular multiplication have been presented. These architectures offer a Processing Elements (PEs) array where each PE performs arithmetic additions and multiplications in a multiprecision context with carry propagation [10]. Depending on the word size (or radix) used, the architecture can employ a high number of Processing Elements, consequently increasing the needs of the logic elements (area) in FPGA implementations.
As a new alternative in terms of implementation, the execution of additions and multiplications can be multiplexed by a block positioned parallel to the Processing Elements. This can be done by inserting multiplexed multipliers in parallel with Processing Elements. Forcing a pipelined operation mode and using a highradix architecture (16 or 32 bits), the multiplexed multipliers ensure the high speed performance provided by systolic architectures, with reduced arithmetic and logic elements and also minimal carry signals propagation.
This paper presents a tradeoff between two proposed modular multiplication architectures: a systolic and very highradix multiplexed implementation. Our approach uses a radix16 and radix32 in both implementations to speed up the processes and to match the resource usage of Virtex4 and Virtex5 Xilinx FPGA Series [11]. The proposed architectures show significant improvements compared to our previous work [12]. Systolic architecture provides more simplified Processing Elements in order to reduce the utilization of FPGA resources. The multiplexed implementation is arranged in arithmetic cores, which allow us to handle the quantity of Processing Elements and multiplier blocks. Our goal is to highlight that the small increase in the number of clock cycles needed due to multiplexed multipliers made up for the significant reduction in the use of logical and architectural arithmetic.
This paper is organized as follows: Section 2 presents the Montgomery modular multiplication algorithm. Section 3 discusses related stateoftheart works. The proposed architectures are presented in Section 4. Finally, the results and conclusion are presented in Sections 5 and 6, respectively.
2. Montgomery Modular Multiplication
The Montgomery Multiplication Algorithm is a method of performing modular multiplication without needing to divide by . In cryptography, the Montgomery Algorithm is very suitable for the hardware implementation of modular multiplication, because it allows long integer numbers to be represented in a numeric precision given by a radix (generally a power of two).
The algorithm version used in this work is the original one, with some preconditions. Algorithm 1 shows the modular multiplication with the notation proposed on [13], and used for the remainder of this text.

The value is the modular inverse of regarding the modulus, computed so that . The final result is placed on , after iterations, and is equal to , which must be corrected to retrieve the expected result (). The correction is done by performing an additional Montgomery multiplication with and as parameters. It is interesting to highlight that this correction is inexpensive during a modular exponentiation, because it only needs to be made one time after the whole exponentiation.
Since its publication in 1985 by Montgomery [2], the Montgomery Algorithm has undergone many modifications and improvements [14, 15]. One of those is particularly interesting, because it avoids the final subtraction simply by choosing the input data correctly. By limiting the operands and to integers less than and by defining as less than , the final is guaranteed to be less than [15]. These preconditions are shown in Algorithm 1 and applied to our architecture, as explained in Section 4.
3. Related Works
Tenca and Koç are widely referenced for their work on radix2 Montgomery Algorithm implementations. These authors initially proposed architectures with improvements for the radix2 Montgomery Algorithm, like in [16]. Even though the input operands are large numbers, radix2 modular multiplications avoid expensive multiplications, which are visible on highradix implementations (8 or more). Different from the classic radix2 Montgomery Algorithm [13], Tenca and Koç’s modifications allow the scalable property for modular multiplication architecture, that is, their proposed Montgomery multiplier is able to work with any precision of the input operands. In terms of hardware implementation, there is a systolic array architecture composed of Processing Elements and control blocks for managing the I/O words of the architecture. Each Processing Element contains only a few logic elements, providing a reduced area and high clock frequency, when synthesized for FPGA or ASIC.
Based on the above work, in [4, 17] improvements are presented to the Tenca and Koç proposition. The advantage of these new approaches is concentrated in the Processing Elements optimizations and, consequently, in the reduced latency of the Montgomery modular multiplications by a minimum factor of two, that is, the modular multiplication is twice as fast than [16]. So, the main contributions are in the modular multiplication speed improvement, and in the reduced number of logical elements for the Processing Elements. In [4], a radix4 scalable Montgomery modular multiplication architecture is proposed to enhance the speed. Despite improvements in speed, these radix2 and radix4 architectures are still limited by the large number of clock cycles required.
Furthermore, in the context of highradix implementations, a systolic architecture is presented in [3] which is composed of Processing Elements able to provide modular multiplication for a radix greater than 4. Despite its time and area efficiency, this architecture requires preprocessing before the modular multiplication execution. The authors make use of the optimized Montgomery algorithm initially proposed in [14], which presented a way to simplify the quotient calculus, making the quotient determine a simple truncation operation . However, as a consequence, the input operands must meet the following limitations: and , and the optimized Montgomery Algorithm will need three additional iterations, because the input operand is left shifted by and has to be corrected with these further iterations.
To avoid preprocessing in a highradix modular multiplication, [5] presents a fully systolic array architecture composed of Processing elements containing internal multipliers and adders. The Montgomery algorithm version used in this implementation is also the optimized version proposed in [14]. As an implementation in radix16, the modular multiplications take only 103 clock cycles, significantly less than other architectures [3, 16, 17].
4. The Proposed Architectures
The proposed architectures for performing Montgomery modular multiplication are detailed in this section. First, the systolic architecture is described in detail as well as the Processing Elements behaviour. Second, the multiplexed and systolic Montgomery modular multiplication architecture is presented.
4.1. The Systolic Architecture
The concept of systolic architecture combines a highly parallel array of identical Processing Elements or datapaths with local connections, which take external inputs and process them in a predetermined manner and in a pipelined fashion.
The proposed systolic architecture is directly based on the arithmetic operations of the Montgomery Algorithm, which are performed in a numerical base , in which the large input operands are processed in a multiprecision context containing words of bits. As seen in Section 2, the Montgomery Algorithm has additions and multiplications involving large integers that make use of multipleprecision arithmetic.
The architecture is composed of Processing Elements distributed in a onedimensional array, where each Processing Element is responsible for the calculus involving bits words of the input operands with the same index of the Processing Element. For example, for a 1024 bits modular multiplication with radix32, the operands are split in 32 words of 32 bits which results in a onedimensional array of 32 Processing Elements.
Between the Processing Elements, there is a propagation of carry signals which are the most significant bits of the arithmetic processes in each PE. The carry signals are processed as input parameters by the Processing Elements that receive them.
In the systolic architecture, the Processing Elements are designed by finite state machines. The control block communicates with the first Processing Element (PE1) and with the block responsible for the quotient calculation , according to line 4 of the Montgomery Algorithm. Figure 1 presents the systolic architecture.
The finite state machine structure of the control block is designed to provide the required words for a modular multiplication to the Processing Elements and to the quotient block. Thus, at each Montgomery Algorithm iteration, these words are read from an external RAM memory and passed to the remaining architecture. At the end of the modular multiplication, the control block provides the Montgomery multiplication result through an output multiplexer.
The onedimensional array of Processing Elements performs the calculation of , according to the Montgomery Algorithm. In this operation, there are two multiplications between an input operand and a bits word, and after the addition between the result of these two multiplications. Therefore, the systolic architecture works in a multiprecision context, and each Processing Element is responsible for performing the arithmetic operations involving one word of each input operand. Thus, the number of words of each operand is equal to the number of Processing Elements. Figure 2 shows the arithmetic operations flowchart within each processing element.
According to Figure 2, the multiplication between and words returns a bits result, where the least significant bits of this multiplication are added to the least significant bits of the multiplication result. Finally, the least significant bits of this add are also added to a bits word of the result of the previous iteration. The carry signals propagated to the next Processing Element are the most significant bits of the two multiplications and the most significant bits of the last addition.
4.1.1. First Processing Element
The first Processing Element (PE1) establishes communication with the control block and receives and words at each Montgomery Algorithm iteration. This PE differs from the other Processing Elements because it does not receive any carry signal as input and it discards the first word of the result, which means the division of by . The zero index words of and ( and ) are also provided to this first Processing Element. The internal architecture of PE1 is shown in Figure 3.
4.1.2. General Processing Element
The other Processing Elements are different from PE1 because they have a word from the result as output and they also transmit and receive carry signals of the multiprecision multiplications and additions. Each Processing Element is activated by the previous Processing Element when the latter finishes its calculation and sends out its carry signals, which means that the architecture works with a pipeline behaviour. Only the last Processing Element provides two words of the result as a response at each iteration of Algorithm 1 because the word is obtained with a sum of carry signals. By avoiding a new Processing Element instantiation juts to perform this sum, it is calculated in the last Processing Element. Figure 4 presents the internal architecture of the general Processing Elements.
4.1.3. Quotient Block
At each iteration of Algorithm 1, line 3 presents the quotient computation so that becomes a multiple of . The internal architecture of the quotient block is shown in Figure 5. This structure has a combinational behaviour where the result is obtained in one clock cycle. , , , and are bits words which are provided for this block at each iteration of Algorithm 1.
The zero index of and means that these words contain the least significant bits (LSBs) of and operands, respectively. As we can see in the right side of Figure 5, a multiplication between and will provide a bits result. Just the LSB part of this result is used in the next operation. Another input of the quotient block, , is then added to the LSB part obtained from the first multiplication. Again, we only need the LSB part of this addition, which is finally multiplied by , which corresponds to the modular inverse of modulo . The LSB part of this last multiplication is the desired result. As seen in Algorithm 1, the numerical basis is power of 2, so for hardware architecture, the operation is simply performed by a right shift operation (LSB selection).
So, the complexity of the quotient block relies on two single precision multiplications and one single precision addition. To evaluate the number of clock cycles for a modular multiplication, we have to consider the first cycles to read the and operands from RAM memories for a square or modular multiplication, respectively. The first iteration of Algorithm 1 also needs clock cycles. The remaining iterations of Algorithm 1 are performed in clock cycles.
4.2. The Multiplexed Systolic Architecture
As seen in the previous section, the systolic architecture presents a onedimensional array of Processing Elements, and each PE is responsible for operations of addition and multiplication. When the numerical basis () is high (2^{16}, 2^{32}), the internal multiplications become more complex, mainly if the design is applied to an FPGA or an ASIC. So, as the number of multipliers increases, the physical limitations will increase proportionally, for example, in the maximum clock frequency, area, (etc.).
Based on these constraints, a multiplexed and systolic architecture with multiplier blocks working parallel to the Processing Elements is presented in this section. It provides a migration of bits multipliers from the Processing Elements to the multipliers blocks. Each multiplier block, together with the four Processing Elements, forms an arithmetic core. The onedimensional arrangement of these arithmetic cores forms the structure of the modular multiplication architecture. Figures 6 and 7 show the multiplexed and systolic architecture and the arithmetic core structure, respectively.
The multiplexed architecture is composed of exactly arithmetic cores, and the first one is managed by a control block designed by a finite state machine. According to Figure 7, each arithmetic core contains four Processing Elements, a multiplier, and an bits RAM memory. Being a multiprecision arithmetic architecture, the number of Processing Elements is equivalent to the number of words in each input operand. So, the RAM memory placed in each arithmetic core stores four words of and operands.
The multiplier block performs the and multiplications. The least significant bits of multiplication are added to a bits word of previous result. The least significant bits of this add operation and the least significant bits of the multiplication are sent to the Processing Elements to be added. The Processing Elements provide the words of the current iteration result. Figure 8 illustrates the executions performed by Arithmetic Core 1. By analysing this illustration, we can realize that instead of having two single precision multiplications in each Processing Element, there is a multiplier block that performs all single precision multiplications for a total of four Processing Elements. In other words, the quantity of single precision multiplications is reduced four times. With these improvements, each Processing Element needs to perform just one addition.
The calculation of the quotient is performed by a block with architecture that is identical to that of the quotient block presented in the systolic architecture.
The Montgomery Algorithm's multiplications are made by a multiplier block that utilizes the multipliers available in the FPGA. The internal architecture of the multiplier blocks is shown in Figure 9.
The carry signals propagated inside the multiplexed architecture are the most significant bits of the and operations presented in Algorithm 1 and are propagated between the multiplier blocks. The last multiplier block sends its carry signals to the fourth and final Processing Element placed in the last arithmetic core. The other carry signal, , is the most significant bits of the result of the addition between the and terms. This last addition is performed by the Processing Elements.
At the end of the iteration, the is sent out by an bits multiplexer. This result is sent to the memory that is part of the modular exponentiation architecture (described in the next section).
In terms of clock cycles for the Montgomery modular multiplication, we can define the following: initially, clock cycles are reserved for operand internal storage. This operand is read from RAM memories. Considering that the modulus is already available on internal RAM memories placed in arithmetic cores, the first iteration also takes clock cycles and, it takes the architecture clock cycles to perform the remaining iterations of Montgomery Algorithm. Thus, the total number of clock cycles, for a modular (or squared) multiplication is .
4.2.1. The Processing Elements PEs
The proposed modular multiplication architecture is composed of Processing Elements (where is the number of words of the operands and also the number of iteration on Algorithm 1). Due to the placement of a multiplier block in each arithmetic core, each Processing Element needs to perform just one addition between two bits words and sends out a word of result at each iteration of the Algorithm 1. The first Processing Element must discard the least significant bits of its first addition in order to perform the right shift operation, which corresponds to the division of by .
The remaining Processing Elements perform the addition between and terms and the resultant least significant bits word of this addition are sent out as a word of the result. The most significant bits are sent to the next Processing Element as a carry signal. The last Processing Element () is responsible for providing two words of result ( and ), considering that the input words for calculus are the carry signals from the last multiplier block. Figure 10 shows the first, general case and the last Processing Elements.
4.3. Modular Exponentiation
For a real cryptographic application concerning the RSA algorithm, a modular exponentiation structure that incorporates the modular multiplication architecture is proposed in this section. The modular exponentiation algorithm used in this work is lefttoright square and multiply [13], and thus in average modular multiplications (including squares and multiplies executions) are performed to achieve the final exponentiation result, which is the operand’s precision. Algorithm 2 shows the Montgomery modular exponentiation algorithm.

Four Block RAM memories generated through Xilinx Coregen tool were placed to store the input operands of size . These input operands are the modulus, the exponent, the message in the Montgomery domain (), and an auxiliary term control block with a finite state machine manages the read and write operations from the memories (see Figure 11).
The results of the successive modular multiplications are stored in the RAM memory that previously has stored the operand, because this operand is necessary just in the first square execution.
5. Results
Table 1 summarizes the FPGA synthesis results of two proposed modular multiplication architectures. The designs were described in hardware description languages (VHDL and Verilog) and synthesized for Virtex4 and Virtex5 Xilinx FPGAs. All results are postimplementation, and no area or speed optimizations were set for the synthesis. The results presented in this paper are improvements when compared with our previous work [12]. The multiplexed architecture is implemented with a reduced number of slices registers and DSP48s. However, synthesis for the systolic architecture presented high clock frequencies.
Table 2 presents an RSA encryption and decryption applications of the proposed architectures. Since the modular exponentiation is performed by successive modular multiplication executions, the lefttoright (MSB) binary square and multiply algorithm was employed in the modular exponentiation. The results show that, considering the amount of clock cycles for a modular multiplication execution, the multiplexed architecture is faster than the systolic implementation. On the other hand, the systolic architecture has a clock frequency higher than the clock frequency presented by the multiplexed architecture.
Table 3 shows a stateofart comparison with our results. Every work referred in this table used the Montgomery Algorithm for their hardware modular multiplication architectures, and for a direct comparison with our approaches just 1024 bits applications are exposed. The time of modular multiplications, when not explained in the references, are estimated considering a modular exponentiation of bits through the Square and Multiply algorithm, running 1.5n modular multiplications.
6. Conclusion
This paper presented two Montgomery modular multiplication architectures and the results of their synthesis for Xilinx Virtex4 and Virtex5 FPGAs. A systolic implementation and a multiplexed implementation, suitable for RSA publickey cryptosystem, were developed, and the designs were carefully matched with features of the FPGAs, utilizing embedded DSP48Es Slices and Block RAM. The designs are improvements of a previous work. The multiplexed implementation presented a good performance considering time × area efficiency. The systolic architecture can run the 1024 bits RSA decryption process in 3.23 ms, and the multiplexed implementation executes the same operation in 4.36 ms. Because of the multiplexed approach, the architecture is scalable. If the key size increases, the architecture can be easily modified by adding arithmetic cores, keeping the performance. Another speed improvement can be achieved by using a parallel modular exponentiation algorithm, for example, the Montgomery Powering Ladder [18] where a full modular exponentiation would be performed in exactly clock cycles, that is, 33% faster than square and multiply algorithm.
Acknowledgment
This paper is result of project “INOVALABLaboratories Technological Innovation in Electronic and Microelectronic”. We acknowledge the financial support received from FINEP.