Research Article  Open Access
Jiankuo Dong, Fangyu Zheng, Wuqiong Pan, Jingqiang Lin, Jiwu Jing, Yuan Zhao, "Utilizing the DoublePrecision FloatingPoint Computing Power of GPUs for RSA Acceleration", Security and Communication Networks, vol. 2017, Article ID 3508786, 15 pages, 2017. https://doi.org/10.1155/2017/3508786
Utilizing the DoublePrecision FloatingPoint Computing Power of GPUs for RSA Acceleration
Abstract
Asymmetric cryptographic algorithm (e.g., RSA and Elliptic Curve Cryptography) implementations on Graphics Processing Units (GPUs) have been researched for over a decade. The basic idea of most previous contributions is exploiting the highly parallel GPU architecture and porting the integerbased algorithms from generalpurpose CPUs to GPUs, to offer high performance. However, the great potential cryptographic computing power of GPUs, especially by the more powerful floatingpoint instructions, has not been comprehensively investigated in fact. In this paper, we fully exploit the floatingpoint computing power of GPUs, by various designs, including the floatingpointbased Montgomery multiplication/exponentiation algorithm and Chinese Remainder Theorem (CRT) implementation in GPU. And for practical usage of the proposed algorithm, a new method is performed to convert the input/output between octet strings and floatingpoint numbers, fully utilizing GPUs and further promoting the overall performance by about 5%. The performance of RSA2048/3072/4096 decryption on NVIDIA GeForce GTX TITAN reaches 42,211/12,151/5,790 operations per second, respectively, which achieves 13 times the performance of the previous fastest floatingpointbased implementation (published in Eurocrypt 2009). The RSA4096 decryption precedes the existing fastest integerbased result by 23%.
1. Introduction
With the rapid development of ecommerce and cloud computing, the highdensity calculation of digital signature and asymmetric cryptographic algorithms such as the Elliptic Curve Cryptography (ECC) [1, 2] and RSA [3] algorithms is urgently required. However, without significant development in recent years, CPUs are more and more difficult to keep pace with such rapidly increasing demands. Specialized for the computeintensive and highparallel computations required by graphics rendering, GPUs own much more powerful computation capability than CPUs by devoting more transistors to arithmetic processing unit rather than data caching and flow control. With the advent of NVIDIA Compute Unified Device Architecture (CUDA) technology, it is now possible to perform generalpurpose computation on GPUs. Due to their powerful arithmetic computing capability and moderate cost, many researchers resort to GPUs to perform cryptographic acceleration.
Born for highdefinition 3D graphics, GPUs require highspeed floatingpoint processing capability; thus the floatingpoint units residing in GPUs achieve dramatical increase. From 2010 to the present, floatingpoint computing power of CUDA GPUs grows almost 10 times, from 1,345/665.6 (Fermi architecture) Giga Floatingpoint Operations Per Second (GFLOPS) to 10,609/5304 (Pacal architecture) GFLOPS for single/doubleprecision floatingpoint arithmetic. By contrast, integer multiplication arithmetic of NVIDIA GPUs gains only 25% performance improvement from Fermi to Kepler architecture; the latest Maxwell and Pascal architectures even do not provide dedicated device instructions for 32bit integer multiply and multiplyadd arithmetic [4].
However, the floatingpoint instruction set is inconvenient to realize large integer modular multiplication which is the core operation of asymmetric cryptography. More importantly, the floatingpoint instruction set in the previous GPUs shows no significant performance advantage over the integer one. To the best of our knowledge, Bernstein et al. [5] are the first and the only one to utilize the floatingpoint processing power of CUDA GPUs for asymmetric cryptography. However, compared with their later work [6] based on the integer instruction set, the floatingpointbased one only achieves about 1/6 performance. Nevertheless, with the rapid development of GPU floatingpoint processing power, fully utilizing the floatingpoint processing resource will bring great benefits to the asymmetric cryptography implementation in GPUs.
Based on the above observations, in this paper, we propose a new approach to implement highperformance RSA by fully exploiting the doubleprecision floatingpoint (DPF) processing power in CUDA GPUs. In particular, we propose a DPFbased large integer representation and adapt the Montgomery multiplication algorithm to it. Also, we flexibly employ the integer instruction set to supplement the deficiency of the DPF computing power. Besides, we fully utilize the latest shuffle instruction to share data between threads, which makes the core algorithm Montgomery multiplication a nonmemoryaccess design, decreasing greatly the performance loss in the thread communication. Additionally, a method is implemented to apply the proposed DPFbased RSA decryption algorithm in practical scenarios where the input and output shall be in bitformat, improving further the overall performance by about 5%, by decreasing data transfer consumption using smaller data format and leveraging GPUs to promote the efficiency of data conversion.
With these improvements, in GTX TITAN, the performance of the proposed RSA implementation increases dramatically compared with the previous works. In particular, the experimental results of RSA2048/3072/4096 decryption with CRT reach the throughput of 42,211/12,151/5,790 operations per second and achieve 13 times the performance of the previous floatingpointbased implementation [5], and RSA4096 decryption is 1.23 times the performance of the existing fastest integerbased one [7].
The rest of our paper is organized as follows. Section 2 introduces the related work. Section 3 presents the overview of GPU, CUDA, floatingpoint elementary knowledge, RSA, and Montgomery multiplication. Section 4 describes our proposed floatingpointbased Montgomery multiplication algorithm in detail. Section 5 shows how to implement RSA decryption in GPUs using our proposed Montgomery multiplication. Section 6 analyses performance of proposed algorithm and compares it with previous work. Section 7 concludes the paper.
2. Related Work
Many previous papers demonstrate that the GPU architecture can be used as an asymmetric cryptography workhorse. For ECC, Antão et al. [8, 9] and Pu and Liu [10] employed the Residue Number System (RNS) to parallelize the modular multiplication into several threads. Bernstein et al. [6] and Leboeuf et al. [11] used one thread to handle a modular multiplication with Montgomery multiplication. Pan et al. [12], Zheng et al. [13], Bos [14], and Szerwinski and Güneysu [15] used fast reduction to implement modular multiplication over the Mersenne prime fields [16].
Unlike ECC scheme, RSA calculation requires longer and unfixed modulus and depends on modular exponentiation. Before NVIDIA CUDA appeared on the market, Moss et al. [17] mapped RNS arithmetic to the GPU to implement a 1024bit modular exponentiation. Later in CUDA GPUs, Szerwinski and Güneysu [15] and Harrison and Waldron [18] developed efficient modular exponentiation by both Montgomery multiplication Coarsely Integrated Operand Scanning (CIOS) method and the RNS arithmetic. Jang et al. [19] presented a highperformance SSL acceleration using CUDA GPUs. They parallelized the Separated Operand Scanning (SOS) method [20] of the Montgomery multiplication by single limb. Jeffrey and Robinson [21] used the similar technology to implement 256bit, 1024bit, and 2048bit Montgomery multiplication in GTX TITAN. Neves and Araujo [22] and Henry and Goldberg [23] used one thread to perform single Montgomery multiplication to economize the overhead of thread synchronization and communication; however, their implementations resulted in a very high latency. Emmart and Weems [7] applied one thread to a row oriented multiplication for 1024bit modular exponentiation and a distributed model based on CIOS method for 2048bit modular exponentiation. Yang [24] used an Integrated Operand Scanning (IOS) method with single limb or two limbs for Montgomery multiplication algorithm to implement RSA1024/2048 decryption.
Note that all above works are based on the CUDA integer computing power. Bernstein et al. are the pioneers to utilize CUDA floatingpointing processing power in asymmetric cryptography implementations [5]. They used 28 single precision floatingpoints (SPFs) to represent 280bit integer and implemented the field arithmetic. However, the result was barely satisfactory. Their later work [6] in the same platform GTX 295 using the integer instruction set performed almost 6.5 times the throughput of [5].
3. Background
In this section, a brief introduction to the basic architecture of modern GPUs, the floatingpoint arithmetic, the basic knowledge of RSA, and the Montgomery multiplication are provided.
3.1. GPU and CUDA
CUDA is a parallel computing platform and programming model that enables dramatical increase in computing performance by harnessing the power of GPU. It is created by NVIDIA and implemented by the GPUs which they produce [4].
The target platform, GTX TITAN, is a CUDAcompatible GPU (codename GK110) with the CUDA Compute Capability 3.5 [25], which contains 14 streaming multiprocessors (SMs). 32 threads (grouped as a warp) within one SM concurrently run in a clock. Following the Single Instruction Multiple Threads (SIMT) architecture, each GPU thread runs one instance of the kernel function. A warp may be preempted when it is stalled due to memory access delay, and the scheduler may switch the runtime context to another available warp. Multiple warps of threads are usually assigned to one SM for better utilization of the pipeline of each SM. These warps are called one block. Each SM has 64 KB onchip memory which can be configured as fast shared memory and L1 cache; the maximum allocation of shared memory is 48 KB [25]. Each SM also possesses 64 K 32bit registers. All SMs share 6 GB 256bit wide slow global memory, cached readonly texture memory, and cached readonly constant memory [4, 25]. Each SM of GK110 owns 192 single precision CUDA cores, 64 doubleprecision units, 32 special function units (SFUs), and 32 load/store units [25], yielding a throughput of 192 SPF arithmetic, 64 DPF arithmetic, 160 32bit integer add, and 32 32bit integer multiplication instructions per clock circle [4, 25].
NVIDIA GPUs of Compute Capability 3.x or later bring a new method of data sharing between threads. Previously, shared data between threads requires separated store and load operations to pass data through shared memory. First introduced in the NVIDIA Kepler architecture, shuffle instruction [4, 25] allows threads within a warp to share data. With shuffle instruction, threads within a warp can read any value of other threads in any imaginable permutations [25]. NVIDIA conducted various experiments [26] on the comparison between shuffle instruction and shared memory, which show that shuffle instruction always gives a better performance than shared memory. This instruction is available in GTX TITAN.
3.2. Integer and FloatingPoint Arithmetic in GPU
NVIDIA GPUs with CUDA Compute Capability 3.x support both integer and floatingpoint arithmetic instruction sets. This section introduces some most concerned instructions in asymmetric cryptographic algorithm implementations, including add and multiply(add) operations.
Integer. Integer arithmetic instructions add.cc, addc, mad.cc, and madc are provided to perform multiprecision add and multiplyadd operations, which reference an implicitly specified condition code register (CC) having a single carry flag bit (CC.CF) holding carryin/carryout [27]. With this native support for multiprecision arithmetic, most of the previous works [6, 8–11, 13–15, 18, 19, 21–23] chose to use integer instructions to implement large integer arithmetic in asymmetric cryptographic algorithms.
One noteworthy point is that multiply or multiplyadd instructions of NVIDIA GPUs with Compute Capability 3.x have a unique feature: when calculating the 32bit multiplication, the whole 64bit product cannot be obtained using single instruction but requires two independent instructions (one is for lower32bit half and the other for upper32bit half). Although the whole multiplication “instruction” (mul.wide) is provided which is used in [22, 23], it is a virtual instruction but not the native instruction, which is broken into 2 native instructions (mul.lo. and mul.hi) when running on the GPUs.
FloatingPoint. Floatingpoint arithmetic instructions in CUDA GPUs comply with 7542008 IEEE Standard for FloatingPoint Arithmetic [28]. Among five basic formats which the standard defines, 32bit binary (i.e., SPF) and 64bit binary (i.e., DPF) are supported in NVIDIA GPUs. As demonstrated in Table 1, the real value assumed by a given SPF or DPF data with a sign bit , a given biased exponent , and a significand precision is . Therefore, each SPF or DPF can, respectively, represent 24bit or 53bit integer.

Add and multiply(add) operation instructions for floatingpoint are natively supported. In particular, floatingpoint multiplyadd operation is always implemented by fused multiplyadd (fma) instruction, which is executed in one instruction with single rounding.
Unlike integer instructions, the floatingpoint add or multiplyadd instructions do not support carry flag (CF) bit. When the result of the add instruction is beyond the limit bits of significand (24 for SPF, 53 for DPF), the roundoff operation happens, in which the least significant bits are left out to keep the significand within the limitation. This roundoff causes precisionloss, which is intolerable in cryptographic calculation. Thus in algorithm design, all operations should be carefully scheduled to avoid it.
Table 2 compares the computing power of integer, SPF and DPF in the target platform GTX TITAN. It is found that integer is more advantageous in add operation than floatingpoint, but the multiply(add) operation of DPF is 2.6 times the performance of integer.
 
and adder should be lower than (SPF) or (DPF) bits to avoid roundoff problem. and multiplier should be lower than (SPF) or (DPF) bits to avoid roundoff problem. the product of two 32bit integers requires 2 multiplication instructions, “Bits/SM/CLOCK” for “Integer” is . 
3.3. RSA and Montgomery Multiplication
RSA [3] is an algorithm widely used for digital signature and asymmetric encryption, whose core operation is modular exponentiation. And in practical scenarios, CRT [29] is widely used to promote the RSA decryption. Instead of calculating a bit modular exponentiation directly, two bit modular exponentiations ((1a) and (1b)) and the MixedRadix Conversion (MRC) algorithm [30] (2) are sequently performed to conduct the RSA decryption: where and are bit prime numbers chosen in private key generation (). All parameters, , , , , and () are parts of the RSA private key [31]. Compared with calculating bit modular exponentiations directly, the CRT technology gives 3 times the performance promotion [29].
Even with CRT technology, the bottleneck restricting the overall performance of RSA lies in modular multiplication. In 1985, Montgomery proposed an algorithm [32] to remove the costly division operation from the modular reduction. Let and be the Montgomeritized forms of modulo , where and are coprime and . Montgomery multiplication defines the multiplication between 2 Montgomeritized form numbers, . Even though the algorithm works for any which is relatively prime to , it is more useful when is taken to be a power of 2, which leads to a fast division by .
A series of improvement for the Montgomery multiplication got public since its foundation. In 1995, Orup [33] economized the determination of quotients by loosening the restriction for input and output from to . Algorithm 1 shows the detailed steps.

4. DPFBased Montgomery Multiplication
We aim to implement bit RSA decryption with CRT introduced in Section 3.3; thus bit Montgomery multiplication shall be implemented first. This section proposes a DPFbased Montgomery multiplication parallel calculation scheme directed on CUDA GPUs, including the large integer representation, the fundamental operations, and the parallelism method of Montgomery multiplication.
4.1. Advantages and Challenges of DPFBased RSA
DPF instruction set has a huge advantage in asymmetric cryptographic algorithm acceleration, as demonstrated in Table 2. Since RSA builds on large integer multiplication, such a huge advantage becomes a decisive point to compete with the integer instruction set.
However, it encounters many problems when exploiting its theoretical superior multiplication computing power.(i)RoundOff Problem. Due to the roundoff problem, every detail should be taken into careful consideration, which makes it very difficult and complicated to design and implement the algorithm.(ii)Nonsupport for Carry Flag. Lack of support for carry flag makes it very inconvenient and inefficient to perform multiprecision add or multiplyadd operation. Instead of using only one carryflagsupported integer instruction, the carry has to be manually handled via multiple add or multiplyadd instructions.(iii)Inefficient Add Instruction. From Table 2, it can be found that, unlike multiplication instruction, the DPF add instruction is slower than the integer add instruction. Moreover, nonsupport for carry flag makes it perform even worse in multiprecision add operation.(iv)Inefficient Bitwise Operations. Floatingpoint arithmetic does not support bitwise operations which are frequently used. CUDA Math API does support the fmod function [34], which can be employed to extract the least and most significant bits. But it consumes tens of instructions which is extremely inefficient, while using integer native instructions set; the bitwise operation needs only one instruction.(v)Extra Memory and Register File Cost. A DPF occupies 64bit memory space; however, only 26 or lower bits are used. In this way, times extra cost in memory access and utilization of register files have to be overconsumed. In integerbased implementation, this issue is not concerned since every bit of an integer is utilized.
Before the introduction to the proposed algorithm, several symbols are defined in Table 3 for better reading of the following sections.

4.2. DPFBased Representation of Large Integer
In Montgomery multiplication, multiplyaccumulation operation is frequently used. In CUDA, fused multiplyadd (fma) instruction is provided to perform floatingpoint multiplyadd operation.
When each DPF ( and ) contains () significant bits, times of (the initial value of is zero) can be executed, free from the roundoff problem.
Note that, in Algorithm 1, there are loops and each loop contains 2 fma operations for each limb, where , where stands for the bit length of the modulus. Thus times of fma operations are needed in total. The following requirement should be met for , where so that the number of the supported fma surpasses the required ones. And the lower means more instructions are required to process the whole algorithm. From Formula (3), to achieve the best performance, is chosen as shown in Table 4.

In this contribution, two kinds of DPFbased large integer representations are proposed, Simplified format and Redundant format.(i)Simplified format is like , where each limb contains at most significant bits. It is applied to represent the input of the fma instruction.(ii)Redundant format is like , where each limb contains at most 53 significant bits. It is applied to accommodate the output of the fma instruction.
4.3. Fundamental Operation and Corresponding Optimization
In DPFbased Montgomery multiplication, the fundamental operations include multiplication, multiplyadd, addition, and bit extraction.(i) Multiplication. In the implementation, the native multiplication instruction (mul) is used to perform multiplication. It is required that both multiplicand and multiplier are in Simplified format to avoid roundoff problem.(ii) MultiplyAdd. In CUDA GPUs, fma instruction is provided to perform floatingpoint , which is executed in one step with a single rounding. In the implementation, when using fma instruction, it is required that multiplicand and multiplier are both in Simplified format and addend is in Redundant format but less than ().(iii) Bit Extraction. The algorithm needs to extract the most or least significant bits from a DPF. However, as introduced in Section 4.1, the bitwise operation for DPF is inefficient. Two attempts of improvements are made to promote the performance. The first one is introduced in [35]. Using roundtozero, we can perform to extract the most significant bits and the least significant bits from a DPF . Note that, in , the least significant bits will be left out due to the roundoff operation. The second one is converting DPF to integer then using CUDA native 32bit integer bitwise instruction to handle bit extraction. Through the experiments, it is found that the second method always gives a better performance. Therefore, in most of cases, the second method is used to handle bit extraction. There is only one exception when the DPF is divisible by a , DPF division is used to extract the most significant bits, which can be executed very fast.(iv) Addition. CUDA GPUs provide the native add instruction to perform addition between two DPFs. But as aforementioned, it is inefficient and does not provide support for carry flag. Thus, DPF is first converted into integer; then CUDA native integer add instruction is used to handle addition.
4.4. DPFBased Montgomery Multiplication Algorithm
With reference to Algorithm 1, in the Montgomery multiplication, , , , , and are all bit integer (in fact, is bits long, and it is also represented as a bit integer for a common format). As choosing as limb size, DPF limbs are required to represent , , , and .
In the previous works [19, 21], Montgomery multiplication is parallelized by single limb; that is, each thread deals with one limb (32bit or 64bit integer). The onelimb parallelism causes large cost in the thread synchronization and communication, which decreases greatly the overall throughput. To maximize the throughput, Neves and Araujo [22] performed one entire Montgomery multiplication in one thread to economize overhead of thread synchronization and communication, however, resulting in a high latency, about 150 ms for 1024bit RSA, which is about 40 times of [19].
To make a tradeoff between throughput and latency, in the implementation, we try multiplelimb parallelism, namely, using threads to compute one Montgomery multiplication and each thread dealing with limbs, where . The degree of parallelism can be flexibly configured to offer the maximal throughput with acceptable latency. Additionally, we restrict threads of one Montgomery multiplication within a warp, in which threads are naturally synchronized free from the overhead of thread synchronization and shuffle instruction can be used to share data between threads. To fully occupy thread resource, shall be a divisor of the warp size (i.e., 32).
In the proposed Montgomery multiplication , the inputs , , and are in Simplified format and the initial value of is 0. Two phases are employed to handle one Montgomery multiplication, Computing Phase and Converting Phase. In Computing Phase, Algorithm 2 is responsible to calculate , whose result is represented in Redundant format. Then in Converting Phase, is converted from Redundant format to Simplified format applying Algorithm 3.


4.4.1. Computing Phase
Algorithm 2 is a limb parallelized version of Algorithm 1. In the algorithm, both the inputs , , and and the output are divided equally into groups and each group is assigned into one single thread. In each Thread , the variables with index are processed. Note that if the index of certain variable is larger than , the variable shall be padded with zero. In this section, the lowercase variables (index varies from 0 to ) indicate the private registers stored in the thread and the uppercase ones (index varies from 0 to ) represent the global variables. For example, in Thread represents . And = shuffle means that current thread obtains variable of Thread and stores it into variable using the shuffle instruction.
With reference to Algorithm 1, we introduce the proposed parallel Montgomery multiplication algorithm step by step.
(1) : in this step, each Thread calculates . Note that each thread stores a group of . Thus for each Loop , firstly the corresponding shall be broadcast from certain thread to all threads. In the view of single thread, corresponds to of Thread . The shuffle instruction is used to conduct this broadcast operation. Then each thread executes , where .
(2) : is only related to , which is stored in Thread 0 as . Therefore, in this step, this calculation is only conducted in Thread 0, while other threads are idle. Note that is in Redundant format, and the least significant bits temp of shall be firstly extracted before executing . And in next step, will act as a multiplier; hence then, the same bit extraction shall be also applied to .
(3) : in this step, each Thread calculates . Because is stored only in Thread 0, similar to Step (1), it should also be broadcast to all threads. Then each thread executes , where .
(4) : in this step, each thread conducts a division by by shifting right operation. In the view of the whole algorithm, shifting right can be done by executing () and padding with zero. While in the view of single thread, there are 2 noteworthy points. The first point is that Thread needs to execute but is stored in Thread ; thus Thread needs to propagate its stored to Thread . The second point is that is represented in Redundant format; when executing , the upper bits of the least significant limb need to be stored. Thus, in Thread 0, is calculated instead of in other threads. is at most bits long, and due to the shiftright operation, in each loop is not the same. Thus can be only accumulated within , which does not cause roundoff problem. Note that is divisible by ; as introduced in Section 4.3, a simple division can be used to efficiently extract the most significant bits.
After Computing Phase, is in Redundant format. Next, we use Converting Phase to convert into Simplified format.
4.4.2. Converting Phase
In Converting Phase, is converted from Redundant format to Simplified format: every adds the carry ( does not execute this addition) and holds the least significant bits of the sum and propagates the most significant bits as new carry to . However, this method is serial, and the calculation of every depends on the carry that produces, which does not comply with the parallelism architecture of GPU. In practice, parallelized method is applied to accomplish Converting Phase, which is shown in Algorithm 3.
Algorithm 3 uses symbol to denote that 53bit integer c is divided into most significant bits and least significant bits . Firstly, all threads execute a chain addition for its and store the final carry. Then every Thread (except Thread ) propagates the stored carry to Thread using shuffle instruction and then repeats chain addition with the propagated carry. This step continues until carry of every thread is zero, which can be checked by the CUDAany() voting instruction [4]. The number of the iterations is in the worst case, but for most cases it takes one or two. Compared with the serialism method, over 75% execution time would be economized in Converting Phase using the parallelism method.
After Converting Phase, is in Simplified format. An entire Montgomery multiplication is completed.
5. RSA Implementation
This section introduces the techniques on the implementation of Montgomery exponentiation, CRT computation, and pre/postcomputation format conversion.
5.1. Montgomery Exponentiation
As introduced in Section 3.3, RSA with CRT technology requires two bit Montgomery exponentiation to accomplish bit RSA decryption. With the binary squareandmultiply method, the expected number of modular multiplications is for bit modular exponentiation. The number can be reduced with ary method given by [36] that scans multiple bits, instead of one bit of the exponent. Jang et al. [19] used sliding window technology Constant Length Nonzero Windows (CLNW) [37] to reduce the number of Montgomery multiplications further. But it is not suitable for our design, in which an entire warp may contain more than one Montgomery multiplication and each Montgomery multiplication has different exponentiation which leads to different scanning step size and execution logic. These differences would cause warp divergence, largely decreasing the overall performance. And ary method is timingattackproof, the calculation time of which would not be affected by bit pattern. For all the above reasons, we choose to employ ary method not the CLNW method.
In ary method, where , the window size shall be chosen properly to decline the number of modular multiplications () and memory access (), and the window size is employed as it offers the least computational cost and memory access. In this contribution, ary method is implemented and reduces the number of modular multiplications from 1536 to 1259 for 1024bit modular exponentiation, achieving 17.9% improvement. In ary method, () precompute table () needs to be stored into memory for each Montgomery exponentiation. The memory space required (about 512 KB) is far more than the size of shared memory (at most 48 KB); thus they have to be stored into global memory. Global memory load and store operations consume hundreds of clock circles. To improve memory access efficiency, the Simplified format precompute tables are converted into bit integer format using GPU formatconvert instruction and then stored in global memory. This optimization economizes about 50% memory access consuming.
In Montgomery exponentiation, the exponent is represented in integer. Similar to the processing of shared data in Montgomery multiplication , each thread stores a piece of and uses shuffle to broadcast from certain thread to all threads.
Algorithm 4 shows how to conduct Montgomery exponentiation where MonMul means calculating using Algorithms 2 and 3.

5.2. CRT Computation
In the first implementation, GPU only took charge of the Montgomery exponentiation. And the CRT computation (2) was offloaded to CPU using GNU multiple precision (GMP) arithmetic library [38]. But we find the low efficiency of CPU computing power greatly limits the performance of the entire algorithm, which occupies about 15% of the execution time. Thus we make attempt to integrate the CRT computation into GPU.
For CRT computation, a modular subtraction and a multiplyadd function are additionally implemented. Both functions are parallelized in the threads which take charge of Montgomery exponentiation. The design results in that the CRT computation occupies only about 1% execution time and offers the independence of CPU computing capability. The scheme is shown in Figure 1.
5.3. Pre and Postcomputation Format Conversion
RSA algorithm is built on large integer arithmetic, whose input and output are both large integers. The Digital Signature Standard (FIPS PUB 1864) [39] by National Institute of Standards and Technology (NIST) regulates how to conduct conversion between bitstring and integer. Unfortunately, our DPFbased large integer representation is not consistent with it. Thus before and after applying the proposed DPFbased RSA algorithm, the conversion between DPF and 32bit integer shall be conducted.
The first attempt was made to conduct the conversion in CPU, which is easy to implement, as shown in Figure 2(a): before and after GPU kernel, converting its DPFformat input and output from or to bitstring using bitwise and format conversion instructions of CPU. However, this strategy has two significant drawbacks.
Firstly, coded in DPF, the input and output require much more data transfer (cudaMemcpy) between GPU and CPU. Specifically, since each Simplified format DPF has only 23bit significand, DPFs needs to store a 2048bit integer, which occupies bits and causes )/2048 = 181% extra data transfer overhead. In fact, the overhead is quite high. As introduced in Section 3.3, for bit RSA decryption with CRT technology, it requires bits for input and bits for output, bits in total. Assume 896 2048bit DPFbased RSA decryptions are calculated once, at rate of 6 GB/s for data transfer through PCIe between GPU and CPU; it requires ms for cudaMemcpy, while the total duration for RSA2048 decryption is only about 20 ms.
Secondly, converting data format in CPU is inefficient. The simplest way is using serial operations to convert before and after GPU kernel in CPU. For example, we complete 896 RSA2048 transformations which take over 1.1 ms using CPU’s (Intel E52697 v2) single core in the experiment. But for GPU, it is adept in accomplishing conversion with thousands threads, and the GPU threads can be used to accomplish such amount of RSA in about 0.1 ms in GTX TITAN.
With full consideration for above two drawbacks, we turn to transfer data between GPU and CPU using integer rather than DPF, and in GPU kernel threads for each RSA decryption accomplish data conversion before and after RSA decryption calculation, as shown in Figure 2(b).
Precomputation Conversion. First, CPU transfers the 32bitintegerformat inputs to global memory of GPU. According to its own DPFs needed to process in Algorithms 2 and 3, each CUDA thread extracts bits from the corresponding 32bit integers, reconstructs them in bitinteger format using bitwise instructions, and finally converts bitintegerformat integer into Simplified format DPFs using integertoDPF instruction as the input of the proposed algorithm.
Postcomputation Conversion. The postcomputation conversion is a reverse procedure of the precomputation conversion, converting the Simplified format outputs into 32bitintegerformat, then returning them to CPU.
This method largely reduces the cost of data transfer and leverages the computing power of GPUs. For 2048bit RSA decryption, the overall latency decreases by up to 1.2 ms (5%).
6. Performance Evaluation
This section presents the implementation performance and summarizes the results for the proposed algorithm. Relative assessment is also presented by considering related implementation. The hardware and software configuration used in the experiment are listed in Table 5.

6.1. Experiment Result
Applying the DPFbased Montgomery multiplication algorithm and RSA implementation, respectively, described in Sections 4 and 5, RSA2048/3072/4096 decryptions are implemented in NVIDIA GTX TITAN, respectively.
Several configuration parameters may affect the performance of the kernel, including the following:(i)Batch Size: the number of RSA decryptions per GPU kernel launch(ii)Threads/Block: the number of CUDA threads contained in a CUDA block(iii)Threads/RSA: the number of CUDA threads assigned for each RSA decryption(iv)Regs/Thread: the maximum number of registers assigned for each CUDA thread; both Regs/Thread and Threads/Block Regs/Thread should be restricted within the GPU hardware limitation (i.e., 255 32bit registers per thread and 65536 per CUDA block in GTX TITAN). Table 6 demonstrates the performance for RSA2048/3072/4096 with a set of combinations for all the above configurations.

Note that the configuration Batch Size is chosen from a small number to the hardware limitation. Figure 3 summarizes the impact of Batch Size on the performance, which indicates the larger Batch Size can always lead to the better performance. It is easy to understand, because GPUs pipeline the instructions to leverage instructionlevel parallelism [4], and configuring as large Batch Size as possible can keep the pipeline busy most of time and fully utilize the computing resource in GPUs, thereby yielding better performance. As the throughput is the most concerned factor in this contribution, Table 6 chooses the maximum Batch Size supported by the GPU hardware for assessment (half of the maximum Batch Size is also provided for comparison), although some lower configurations of Batch Size bring a lower latency with an almost equivalent throughput.
Another important factor on performance is Threads/RSA. In Table 6, Column Threads/RSA = indicates threads are used to process a Montgomery multiplication and threads to process one RSA decryption. As discussed in Section 4.4, theoretically, parallelizing single computing task into too many threads would result in poor throughput; meanwhile, parallelism of too few threads may lead to a high latency. Towards the practical usage and also emphasizing a high throughput, in the experiment, on the premise of an acceptable latency, as few Threads/RSA as possible are applied to offer a high throughput.
However, even without the consideration for acceptable latency, at some points, due to the hardware limitation for register file per thread, parallelism using too few threads may not promote but even largely decrease the overall throughput. An example of the experiment is RSA4096 with Threads/RSA = as shown in Table 6. For RSA2048, the throughput of Threads/RSA = always precedes Threads/RSA = . However, for RSA4096 with Threads/RSA = , the growing number of variables required exceeds tremendously the hardware limitation (when Batch Size is , 127 registers per thread in GTX TITAN); thus many variables have to spill into offchip local memory which is hundreds times slower than registers. In this way, the throughput is much less than Threads/RSA = . In fact, the proposed DPFbased algorithm is more sensitive for the number of available registers since it consumes more register file than normal integerbased one as specified in Section 4.1.
To summarize, for best practice of the proposed DPFbased algorithm, many factors should be comprehensively taken into consideration to suit both properties of GPU hardware and RSA modulus size, especially Threads/RSA and Regs/Thread with higher and higher requirement of RSA modulus size (7680 bits and more).
6.2. Performance Comparison
This section provides two groups of comparisons depending on different implementation mechanisms, floatingpointbased and integerbased.
6.2.1. Proposed versus FloatingPointBased Implementation
Bernstein et al. [5] employed the floatingpoint arithmetic to implement 280bit Montgomery multiplication. Table 7 shows performance of the work [5] and ours. Note that the work [5] only implemented the 280bit modular multiplication; thus its performance is scaled by as the performance of the 1024bit one, and it is further scaled by the difference of the floating processing power of the corresponding GPU. The scaled result is shown in the row “1024bit MulMod (scaled).”

Table 7 demonstrates the resulting implementation achieves 13 times speedup compared to the performance of [5]. Part of the performance promotion results from the advantage DPF achieves over SPF as discussed in Section 3.2. The reason why they did not utilize DPF is that GTX 295 they used does not support DPF instructions. The second reason of the advantage comes from the process of Montgomery multiplication. Bernstein et al. used Separated Operand Scanning (SOS) Montgomery multiplication method which is known inefficient [20]. And they utilized only 28 threads of a wrap (consisting of 32 threads), which wastes 1/8 processing power. The third reason is that we used the CUDA latest shuffle instruction to share data between threads, while [5] used shared memory. As Section 3.1 introduced, the shuffle instruction gives a better performance than shared memory. The last reason lies in that Bernstein et al. [5] used floatingpoint arithmetic to process all operations, some of which are more efficient using integer instruction such as bit extraction. By contrast, we flexibly utilize integer instructions to accomplish these operations.
6.2.2. Proposed versus IntegerBased Implementation
Previous works [7, 19, 21–23] are all integerbased. Thus we scale their CUDA platform performance based on the integer processing power. The parameters in Table 8 origin from [4, 40], but the integer processing power is not given directly. Taking SM number, processing power of each SM, and Shader Clock into consideration, we calculate integer multiplication instruction throughput Int Mul. Among them, 8800 GTS and GTX 260 support only 24bit multiply instruction, while the other platforms support 32bit multiply instruction. Hence, we adjust their integer multiply processing capability by a correction parameter (unadjusted data is in parenthesis). Throughput Scaling Factor is defined as Int Mul ratio between the corresponding CUDA platform and GTX TITAN. And Throughput Scaling Factor is also defined, as the Shader Clock ratio.
 
et al. also report the latency of RSA2048 decryption is 6.5 ms (after scaled 6.8 ms) when the Batch Size is 1, at the moment the throughput is 154. peak 2048bit RSA throughput, when Threads/RSA is , window size is 6, Max Reg. is 127, and Batch Size is . peak 4096bit RSA throughput, when Threads/RSA is , window size is 6, Max Reg. is 127, and Batch Size is . 
Table 8 summarizes the resulting performance of each work. We divide each resulting performance by the corresponding Scaling Factor listed in Table 8 as the scaled performance. Note that the RSA key length of Neves and Araujo [22] is 1024 bits, while ours is 2048 bits, and we multiply it by an additional factor (1/4 for the performance of modular multiplication, 1/2 for the half bits of the exponent).
From Table 8, at the modular multiplication level, the proposed implementation outperforms the others by a great margin for floatingpointbased RSA. We achieve nearly 6 times speedup compared to the work [23], and even at the same CUDA platform, we obtain 221% performance of the work [21].
At RSA implementation level, the proposed implementation also shows a great performance advantage, 291% performance of the work [22] for RSA2048. Note that RSA1024 decryption of [22] has latency of about 150 ms, while 2048bit RSA decryption of ours reaches 21.52 ms when it reaches the throughput peak. Yang [24] reported the latency for RSA2048 throughput of 5,244 (scaled as 31,782) is 195.27 ms (scaled as 225.60 ms). The throughput of the proposed RSA2048 implementation is 132% performance of Yang’s work, but the latency is 9.4% of their work [24]. Our peak RSA2048 throughput is slightly slower than the fastest integerbased implementation [7], while our latency is 3 times faster. For RSA4096 decryption, Emmart and Weems use distributed model [7] based on CIOS method with distributed integer values to report a throughput of 5257 RSA4096 decryption on a GTX780Ti, scaling the performance to a GTX TITAN is 4,693, and our work is 1.23 times faster.
The performance advantage lies mainly in the utilization of floatingpoint processing power and the superior handling of Montgomery multiplication, which overcomes the problems addressed in Section 4.1. The DPFbased representation of large integer and Montgomery multiplication is carefully scheduled to avoid roundoff problem and decrease largely the number of the expensive carry flag handling processes. For the inefficient add instruction and bitwise operations of DPF, the integer instruction set is flexibly employed to supplement the deficiency. And precompute table storage optimization (in Section 5.1) and the design of pre and postcomputation format conversion (in Section 5.3) mitigate the performance influence brought by extra memory and register file cost.
Besides, compared with the works using multiple threads to process a Montgomery multiplication [19, 21], another vital reason is that we use more efficient shuffle instruction to handle data sharing instead of shared memory and employ more reasonable degree of thread parallelism to economize the overhead of thread communication.
6.3. Discussion
In 2016 GPU Technology Conference, NVIDIA released the latest NVIDIA’s Tesla P100 accelerator using the Pascal GP100 which has a huge improvement in floatingpoint computing power, especially DPF. Tesla P100 can deliver 5.3 TFLOPS of DPF performance which is 3.4 times our target platform [41]. Meanwhile the number of the registers for each CUDA core is doubled [41], which is essential for DPFbased RSA implementation. The performance of proposed DPFbased algorithm can be improved by at least three times based on all above improvements, which will further widen the gap between the DPFbased and the traditional integerbased algorithms.
7. Conclusion
In this contribution, we propose a brand new approach to implement highperformance RSA cryptosystems in the latest CUDA GPUs by utilizing the powerful floatingpoint computing resource. Our results demonstrate that the floatingpoint computing resource is a more competitive candidate for the asymmetric cryptography implementation in CUDA GPUs. In NVIDIA GeForce GTX TITAN, our resulting RSA2048 decryption reaches a throughput of 42,211 operations per second, which achieves 13 times the performance of the previous floatingpointbased implementation. For RSA3072/4096 decryption, our peak throughput is 12,151 and 5,790, and RSA4096 implementation is 1.23 times faster than the best integerbased result. We also hope our endeavor can shed light on future research and inspire more case studies on GPU using floatingpoint computing power as well as other asymmetric cryptography. We will apply these designs to arithmetic, such as the floatingpointbased ECC implementation, and exploit the floatingpoint power of Tesla P100.
Disclosure
A preliminary version of this paper appeared under the title Exploiting the FloatingPoint Computing Power of GPUs for RSA, in Proc. Information Security, 17th International Conference, ISC 2014, Hong Kong, October 12–14, 2014 [42] (Best Student Paper Award). Dr. Yuan Zhao participated in this work when he studied in Chinese Academy of Sciences and now works in Huawei Technologies, Beijing, China.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was partially supported by the National 973 Program of China under Award no. 2014CB340603 and the Strategic Priority Research Program of Chinese Academy of Sciences under Grant XDA06010702.
References
 N. Koblitz, “Elliptic curve cryptosystems,” Mathematics of Computation, vol. 48, no. 177, pp. 203–209, 1987. View at: Publisher Site  Google Scholar  MathSciNet
 V. S. Miller, “Use of elliptic curves in cryptography,” in Advances in Cryptology (CRYPTO '85), H. C. Williams, Ed., vol. 218 of Lecture Notes in Computer Science, pp. 417–426, Springer, 1986. View at: Publisher Site  Google Scholar  MathSciNet
 R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digital signatures and publickey cryptosystems,” Communications of the Association for Computing Machinery, vol. 21, no. 2, pp. 120–126, 1978. View at: Publisher Site  Google Scholar  MathSciNet
 2013, NVIDIA: CUDA C programming guide 8.0, https://docs.nvidia.com/cuda/cudacprogrammingguide/.
 D. J. Bernstein, T.R. Chen, C.M. Cheng, T. Lange, and B.Y. Yang, “ECM on Graphics Cards,” in Advances in CryptologyEUROCRYPT, pp. 483–501, Springer, Berlin, 2009. View at: Google Scholar  MathSciNet
 D. J. Bernstein, H. C. Chen, M. S. Chen et al., “The BillionMulmodPerSecond PC,” in Workshop record of SHARCS, vol. 9, pp. 131–144, 2009. View at: Google Scholar
 N. Emmart and C. Weems, “Pushing the performance envelope of modular exponentiation across multiple generations of gpus,” in Proceedings of the 29th IEEE International Parallel and Distributed Processing Symposium (IPDPS '15), pp. 166–176, May 2015. View at: Publisher Site  Google Scholar
 S. Antão, J.C. Bajard, and L. Sousa, “Elliptic curve point multiplication on GPUs,” in Proceedings of the 21st IEEE International Conference on Applicationspecific Systems, Architectures and Processors (ASAP '10), pp. 192–199, July 2010. View at: Publisher Site  Google Scholar
 S. Antão, J.C. Bajard, and L. Sousa, “RNSbased elliptic curve point multiplication for massive parallel architectures,” Computer Journal, vol. 55, no. 5, pp. 629–647, 2012. View at: Publisher Site  Google Scholar
 S. Pu and J.C. Liu, “EAGL: An elliptic curve arithmetic GPUbased library for bilinear pairing,” in PairingBased Cryptography–Pairing, pp. 1–19, Springer, 2014. View at: Google Scholar
 K. Leboeuf, R. Muscedere, and M. Ahmadi, “A GPU implementation of the Montgomery multiplication algorithm for elliptic curve cryptography,” in Proceedings of the 2013 IEEE International Symposium on Circuits and Systems (ISCAS '13), pp. 2593–2596, May 2013. View at: Publisher Site  Google Scholar
 W. Pan, F. Zheng, Y. Zhao, W. T. Zhu, and J. Jing, “An efficient elliptic curve cryptography signature server with GPU acceleration,” IEEE Transactions on Information Forensics and Security, pp. 111–122, 2017. View at: Google Scholar
 F. Zheng, W. Pan, J. Lin, J. Jing, and Y. Zhao, “Exploiting the Potential of GPUs for Modular Multiplication in ECC,” in Information Security Applications  15th International Workshop, pp. 295–306, Springer International Publishing, 2014. View at: Google Scholar
 J. W. Bos, “Lowlatency elliptic curve scalar multiplication,” International Journal of Parallel Programming, vol. 40, no. 5, pp. 532–550, 2012. View at: Publisher Site  Google Scholar
 R. Szerwinski and T. Güneysu, “Exploiting the power of GPUs for asymmetric cryptography,” in Cryptographic Hardware and Embedded Systems–CHES, pp. 79–99, Springer Berlin Heidelberg, Berlin, Heidelberg, 2008. View at: Google Scholar
 J. A. Solinas, Generalized mersenne numbers. Citeseer, 1999.
 A. Moss, D. Page, and N. P. Smart, “Toward acceleration of {RSA} using 3D graphics hardware,” in Cryptography and coding, vol. 4887 of Lecture Notes in Comput. Sci., pp. 364–383, Springer, Berlin, 2007. View at: Publisher Site  Google Scholar  MathSciNet
 O. Harrison and J. Waldron, “Efficient acceleration of asymmetric cryptography on graphics hardware,” in Progress in Cryptology–AFRICACRYPT, pp. 350–367, Springer, 2009. View at: Google Scholar
 K. Jang, S. Han, S. Moon, and K. Park, “Sslshader: cheap ssl acceleration with commodity processors,” in Proceedings of the 8th USENIX conference on Networked systems design and implementation, 2011. View at: Google Scholar
 Ç. K. Koç, T. Acar, and B. S. Kaliski Jr., “Analyzing and comparing montgomery multiplication algorithms,” IEEE Micro, vol. 16, no. 3, pp. 26–33, 1996. View at: Publisher Site  Google Scholar
 A. Jeffrey and B. D. Robinson, 2014, Fast GPU based modular multiplication. http://ondemand.gputechconf.com/gtc/2014/poster/pdf/P4156_montgomery_multiplication_CUDA_concurrent.pdf.
 S. Neves and F. Araujo, “On the performance of GPU publickey cryptography,” in Proceedings of the 22nd IEEE International Conference on ApplicationSpecific Systems, Architectures and Processors, ASAP 2011, pp. 133–140, September 2011. View at: Publisher Site  Google Scholar
 R. Henry and I. Goldberg, “Solving discrete logarithms in smoothorder groups with CUDA,” in Workshop Record of SHARCS, pp. 101–118, 2012. View at: Google Scholar
 Y. Yang, “Accelerating RSA with finegrained parallelism using GPU,” in Information Security Practice and Experience, J. Lopez and Y. Wu, Eds., vol. 9065 of Lecture Notes in Computer Science, Springer, 2015. View at: Google Scholar
 2012, NVIDIA: NVIDIA Kepler GK110 whitepaper. http://www.nvidia.com/content/PDF/kepler/NVIDIAKeplerGK110ArchitectureWhitepaper.pdf.
 2013, NVIDIA: Shuffle: tips and tricks. http://ondemand.gputechconf.com/gtc/2013/presentations/S3174KeplerShuffleTipsTricks.pdf.
 2016, NVIDIA: Parallel thread execution ISA version 5.0, http://docs.nvidia.com/cuda/parallelthreadexecution/#axzz4SFUtDRZT.
 IEEE Standards Committee, “7542008  IEEE standard for floatingpoint arithmetic,” pp. 1–58, 2008. View at: Publisher Site  Google Scholar
 J. J. Quisquater and C. Couvreur, “Fast decipherment algorithm for RSA publickey cryptosystem,” Electronics letters, vol. 18, no. 21, pp. 905–907, 1982. View at: Google Scholar
 C. K. Koç, “Highspeed RSA implementation,” Tech. Rep., RSA Laboratories, 1994. View at: Google Scholar
 J. Jonsson and B. Kaliski, Publickey cryptography standards (PKCS)# 1: RSA cryptography specifications version 2.1. The Internet Society, 2003.
 P. L. Montgomery, “Modular multiplication without trial division,” Mathematics of Computation, vol. 44, no. 170, pp. 519–521, 1985. View at: Publisher Site  Google Scholar  MathSciNet
 H. Orup, “Simplifying quotient determination in highradix modular multiplication,” in Proceedings of the 1995 IEEE 12th Symposium on Computer Arithmetic, pp. 193–199, July 1995. View at: Google Scholar
 2016, NVIDIA: NVIDIA CUDA math API. http://docs.nvidia.com/cuda/cudamathapi/index.html#axzz308wmibga.
 D. Hankerson, S. Vanstone, and A. J. Menezes, Guide to Elliptic Curve Cryptography, Springer, New York, NY, USA, 2004. View at: MathSciNet
 D. E. Knuth, The art of computer programming: seminumerical algorithms, AddisonWesley Publishing Co., Reading, Mass, USA, 1981. View at: MathSciNet
 C. K. Koç, “Analysis of sliding window techniques for exponentiation,” Computers and Mathematics with Applications, vol. 30, no. 10, pp. 17–24, 1995. View at: Publisher Site  Google Scholar
 T. Granlund, The gmp development team. gnu mp: The gnu multiple precision arithmetic library, 5.1., 2013.
 P. Gallagher and C. Kerry, FIPS Pub 1864: Digital signature standard. DSS. NIST, 2013.
 2016, Wikipedia: Wikipedia:List of NVIDIA graphics processing units. http://en.wikipedia.org/wiki/Comparison_of_NVIDIA_Graphics_Processing_Units.
 2016, NVIDIA: NVIDIA Tesla P100 whitepaper, https://images.nvidia.com/content/pdf/tesla/whitepaper/pascalarchitecturewhitepaper.pdf.
 F. Zheng, W. Pan, J. Lin, J. Jing, and Y. Zhao, “Exploiting the floatingpoint computing power of GPUs for RSA,” in Proceedings of the Information Security  17th International Conference (ISC '14), pp. 198–215, 2014. View at: Google Scholar
Copyright
Copyright © 2017 Jiankuo Dong et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.