Abstract

Asymmetric cryptographic algorithm (e.g., RSA and Elliptic Curve Cryptography) implementations on Graphics Processing Units (GPUs) have been researched for over a decade. The basic idea of most previous contributions is exploiting the highly parallel GPU architecture and porting the integer-based algorithms from general-purpose CPUs to GPUs, to offer high performance. However, the great potential cryptographic computing power of GPUs, especially by the more powerful floating-point instructions, has not been comprehensively investigated in fact. In this paper, we fully exploit the floating-point computing power of GPUs, by various designs, including the floating-point-based Montgomery multiplication/exponentiation algorithm and Chinese Remainder Theorem (CRT) implementation in GPU. And for practical usage of the proposed algorithm, a new method is performed to convert the input/output between octet strings and floating-point numbers, fully utilizing GPUs and further promoting the overall performance by about 5%. The performance of RSA-2048/3072/4096 decryption on NVIDIA GeForce GTX TITAN reaches 42,211/12,151/5,790 operations per second, respectively, which achieves 13 times the performance of the previous fastest floating-point-based implementation (published in Eurocrypt 2009). The RSA-4096 decryption precedes the existing fastest integer-based result by 23%.

1. Introduction

With the rapid development of e-commerce and cloud computing, the high-density calculation of digital signature and asymmetric cryptographic algorithms such as the Elliptic Curve Cryptography (ECC) [1, 2] and RSA [3] algorithms is urgently required. However, without significant development in recent years, CPUs are more and more difficult to keep pace with such rapidly increasing demands. Specialized for the compute-intensive and high-parallel computations required by graphics rendering, GPUs own much more powerful computation capability than CPUs by devoting more transistors to arithmetic processing unit rather than data caching and flow control. With the advent of NVIDIA Compute Unified Device Architecture (CUDA) technology, it is now possible to perform general-purpose computation on GPUs. Due to their powerful arithmetic computing capability and moderate cost, many researchers resort to GPUs to perform cryptographic acceleration.

Born for high-definition 3D graphics, GPUs require high-speed floating-point processing capability; thus the floating-point units residing in GPUs achieve dramatical increase. From 2010 to the present, floating-point computing power of CUDA GPUs grows almost 10 times, from 1,345/665.6 (Fermi architecture) Giga Floating-point Operations Per Second (GFLOPS) to 10,609/5304 (Pacal architecture) GFLOPS for single/double-precision floating-point arithmetic. By contrast, integer multiplication arithmetic of NVIDIA GPUs gains only 25% performance improvement from Fermi to Kepler architecture; the latest Maxwell and Pascal architectures even do not provide dedicated device instructions for 32-bit integer multiply and multiply-add arithmetic [4].

However, the floating-point instruction set is inconvenient to realize large integer modular multiplication which is the core operation of asymmetric cryptography. More importantly, the floating-point instruction set in the previous GPUs shows no significant performance advantage over the integer one. To the best of our knowledge, Bernstein et al. [5] are the first and the only one to utilize the floating-point processing power of CUDA GPUs for asymmetric cryptography. However, compared with their later work [6] based on the integer instruction set, the floating-point-based one only achieves about 1/6 performance. Nevertheless, with the rapid development of GPU floating-point processing power, fully utilizing the floating-point processing resource will bring great benefits to the asymmetric cryptography implementation in GPUs.

Based on the above observations, in this paper, we propose a new approach to implement high-performance RSA by fully exploiting the double-precision floating-point (DPF) processing power in CUDA GPUs. In particular, we propose a DPF-based large integer representation and adapt the Montgomery multiplication algorithm to it. Also, we flexibly employ the integer instruction set to supplement the deficiency of the DPF computing power. Besides, we fully utilize the latest shuffle instruction to share data between threads, which makes the core algorithm Montgomery multiplication a non-memory-access design, decreasing greatly the performance loss in the thread communication. Additionally, a method is implemented to apply the proposed DPF-based RSA decryption algorithm in practical scenarios where the input and output shall be in bit-format, improving further the overall performance by about 5%, by decreasing data transfer consumption using smaller data format and leveraging GPUs to promote the efficiency of data conversion.

With these improvements, in GTX TITAN, the performance of the proposed RSA implementation increases dramatically compared with the previous works. In particular, the experimental results of RSA-2048/3072/4096 decryption with CRT reach the throughput of 42,211/12,151/5,790 operations per second and achieve 13 times the performance of the previous floating-point-based implementation [5], and RSA-4096 decryption is 1.23 times the performance of the existing fastest integer-based one [7].

The rest of our paper is organized as follows. Section 2 introduces the related work. Section 3 presents the overview of GPU, CUDA, floating-point elementary knowledge, RSA, and Montgomery multiplication. Section 4 describes our proposed floating-point-based Montgomery multiplication algorithm in detail. Section 5 shows how to implement RSA decryption in GPUs using our proposed Montgomery multiplication. Section 6 analyses performance of proposed algorithm and compares it with previous work. Section 7 concludes the paper.

Many previous papers demonstrate that the GPU architecture can be used as an asymmetric cryptography workhorse. For ECC, Antão et al. [8, 9] and Pu and Liu [10] employed the Residue Number System (RNS) to parallelize the modular multiplication into several threads. Bernstein et al. [6] and Leboeuf et al. [11] used one thread to handle a modular multiplication with Montgomery multiplication. Pan et al. [12], Zheng et al. [13], Bos [14], and Szerwinski and Güneysu [15] used fast reduction to implement modular multiplication over the Mersenne prime fields [16].

Unlike ECC scheme, RSA calculation requires longer and unfixed modulus and depends on modular exponentiation. Before NVIDIA CUDA appeared on the market, Moss et al. [17] mapped RNS arithmetic to the GPU to implement a 1024-bit modular exponentiation. Later in CUDA GPUs, Szerwinski and Güneysu [15] and Harrison and Waldron [18] developed efficient modular exponentiation by both Montgomery multiplication Coarsely Integrated Operand Scanning (CIOS) method and the RNS arithmetic. Jang et al. [19] presented a high-performance SSL acceleration using CUDA GPUs. They parallelized the Separated Operand Scanning (SOS) method [20] of the Montgomery multiplication by single limb. Jeffrey and Robinson [21] used the similar technology to implement 256-bit, 1024-bit, and 2048-bit Montgomery multiplication in GTX TITAN. Neves and Araujo [22] and Henry and Goldberg [23] used one thread to perform single Montgomery multiplication to economize the overhead of thread synchronization and communication; however, their implementations resulted in a very high latency. Emmart and Weems [7] applied one thread to a row oriented multiplication for 1024-bit modular exponentiation and a distributed model based on CIOS method for 2048-bit modular exponentiation. Yang [24] used an Integrated Operand Scanning (IOS) method with single limb or two limbs for Montgomery multiplication algorithm to implement RSA-1024/2048 decryption.

Note that all above works are based on the CUDA integer computing power. Bernstein et al. are the pioneers to utilize CUDA floating-pointing processing power in asymmetric cryptography implementations [5]. They used 28 single precision floating-points (SPFs) to represent 280-bit integer and implemented the field arithmetic. However, the result was barely satisfactory. Their later work [6] in the same platform GTX 295 using the integer instruction set performed almost 6.5 times the throughput of [5].

3. Background

In this section, a brief introduction to the basic architecture of modern GPUs, the floating-point arithmetic, the basic knowledge of RSA, and the Montgomery multiplication are provided.

3.1. GPU and CUDA

CUDA is a parallel computing platform and programming model that enables dramatical increase in computing performance by harnessing the power of GPU. It is created by NVIDIA and implemented by the GPUs which they produce [4].

The target platform, GTX TITAN, is a CUDA-compatible GPU (codename GK-110) with the CUDA Compute Capability 3.5 [25], which contains 14 streaming multiprocessors (SMs). 32 threads (grouped as a warp) within one SM concurrently run in a clock. Following the Single Instruction Multiple Threads (SIMT) architecture, each GPU thread runs one instance of the kernel function. A warp may be preempted when it is stalled due to memory access delay, and the scheduler may switch the runtime context to another available warp. Multiple warps of threads are usually assigned to one SM for better utilization of the pipeline of each SM. These warps are called one block. Each SM has 64 KB on-chip memory which can be configured as fast shared memory and L1 cache; the maximum allocation of shared memory is 48 KB [25]. Each SM also possesses 64 K 32-bit registers. All SMs share 6 GB 256-bit wide slow global memory, cached read-only texture memory, and cached read-only constant memory [4, 25]. Each SM of GK-110 owns 192 single precision CUDA cores, 64 double-precision units, 32 special function units (SFUs), and 32 load/store units [25], yielding a throughput of 192 SPF arithmetic, 64 DPF arithmetic, 160 32-bit integer add, and 32 32-bit integer multiplication instructions per clock circle [4, 25].

NVIDIA GPUs of Compute Capability 3.x or later bring a new method of data sharing between threads. Previously, shared data between threads requires separated store and load operations to pass data through shared memory. First introduced in the NVIDIA Kepler architecture, shuffle instruction [4, 25] allows threads within a warp to share data. With shuffle instruction, threads within a warp can read any value of other threads in any imaginable permutations [25]. NVIDIA conducted various experiments [26] on the comparison between shuffle instruction and shared memory, which show that shuffle instruction always gives a better performance than shared memory. This instruction is available in GTX TITAN.

3.2. Integer and Floating-Point Arithmetic in GPU

NVIDIA GPUs with CUDA Compute Capability 3.x support both integer and floating-point arithmetic instruction sets. This section introduces some most concerned instructions in asymmetric cryptographic algorithm implementations, including add and multiply(-add) operations.

Integer. Integer arithmetic instructions add.cc, addc, mad.cc, and madc are provided to perform multiprecision add and multiply-add operations, which reference an implicitly specified condition code register (CC) having a single carry flag bit (CC.CF) holding carry-in/carry-out [27]. With this native support for multiprecision arithmetic, most of the previous works [6, 811, 1315, 18, 19, 2123] chose to use integer instructions to implement large integer arithmetic in asymmetric cryptographic algorithms.

One noteworthy point is that multiply or multiply-add instructions of NVIDIA GPUs with Compute Capability 3.x have a unique feature: when calculating the 32-bit multiplication, the whole 64-bit product cannot be obtained using single instruction but requires two independent instructions (one is for lower-32-bit half and the other for upper-32-bit half). Although the whole multiplication “instruction” (mul.wide) is provided which is used in [22, 23], it is a virtual instruction but not the native instruction, which is broken into 2 native instructions (mul.lo. and mul.hi) when running on the GPUs.

Floating-Point. Floating-point arithmetic instructions in CUDA GPUs comply with 754-2008 IEEE Standard for Floating-Point Arithmetic [28]. Among five basic formats which the standard defines, 32-bit binary (i.e., SPF) and 64-bit binary (i.e., DPF) are supported in NVIDIA GPUs. As demonstrated in Table 1, the real value assumed by a given SPF or DPF data with a sign bit , a given biased exponent , and a significand precision is . Therefore, each SPF or DPF can, respectively, represent 24-bit or 53-bit integer.

Add and multiply(-add) operation instructions for floating-point are natively supported. In particular, floating-point multiply-add operation is always implemented by fused multiply-add (fma) instruction, which is executed in one instruction with single rounding.

Unlike integer instructions, the floating-point add or multiply-add instructions do not support carry flag (CF) bit. When the result of the add instruction is beyond the limit bits of significand (24 for SPF, 53 for DPF), the round-off operation happens, in which the least significant bits are left out to keep the significand within the limitation. This round-off causes precision-loss, which is intolerable in cryptographic calculation. Thus in algorithm design, all operations should be carefully scheduled to avoid it.

Table 2 compares the computing power of integer, SPF and DPF in the target platform GTX TITAN. It is found that integer is more advantageous in add operation than floating-point, but the multiply(-add) operation of DPF is 2.6 times the performance of integer.

3.3. RSA and Montgomery Multiplication

RSA [3] is an algorithm widely used for digital signature and asymmetric encryption, whose core operation is modular exponentiation. And in practical scenarios, CRT [29] is widely used to promote the RSA decryption. Instead of calculating a -bit modular exponentiation directly, two -bit modular exponentiations ((1a) and (1b)) and the Mixed-Radix Conversion (MRC) algorithm [30] (2) are sequently performed to conduct the RSA decryption: where and are -bit prime numbers chosen in private key generation (). All parameters, , , , , and () are parts of the RSA private key [31]. Compared with calculating -bit modular exponentiations directly, the CRT technology gives 3 times the performance promotion [29].

Even with CRT technology, the bottleneck restricting the overall performance of RSA lies in modular multiplication. In 1985, Montgomery proposed an algorithm [32] to remove the costly division operation from the modular reduction. Let and be the Montgomeritized forms of modulo , where and are coprime and . Montgomery multiplication defines the multiplication between 2 Montgomeritized form numbers, . Even though the algorithm works for any which is relatively prime to , it is more useful when is taken to be a power of 2, which leads to a fast division by .

A series of improvement for the Montgomery multiplication got public since its foundation. In 1995, Orup [33] economized the determination of quotients by loosening the restriction for input and output from to . Algorithm 1 shows the detailed steps.

Input:
with , positive integers , such that ;
,  
;
Integer multiplicand , where ;
Integer multiplier , where ,   and ;
Output:
An integer such that and ;
(1)
(2) for    do
(3)
(4)
(5)
(6) end for

4. DPF-Based Montgomery Multiplication

We aim to implement -bit RSA decryption with CRT introduced in Section 3.3; thus -bit Montgomery multiplication shall be implemented first. This section proposes a DPF-based Montgomery multiplication parallel calculation scheme directed on CUDA GPUs, including the large integer representation, the fundamental operations, and the parallelism method of Montgomery multiplication.

4.1. Advantages and Challenges of DPF-Based RSA

DPF instruction set has a huge advantage in asymmetric cryptographic algorithm acceleration, as demonstrated in Table 2. Since RSA builds on large integer multiplication, such a huge advantage becomes a decisive point to compete with the integer instruction set.

However, it encounters many problems when exploiting its theoretical superior multiplication computing power.(i)Round-Off Problem. Due to the round-off problem, every detail should be taken into careful consideration, which makes it very difficult and complicated to design and implement the algorithm.(ii)Nonsupport for Carry Flag. Lack of support for carry flag makes it very inconvenient and inefficient to perform multiprecision add or multiply-add operation. Instead of using only one carry-flag-supported integer instruction, the carry has to be manually handled via multiple add or multiply-add instructions.(iii)Inefficient Add Instruction. From Table 2, it can be found that, unlike multiplication instruction, the DPF add instruction is slower than the integer add instruction. Moreover, nonsupport for carry flag makes it perform even worse in multiprecision add operation.(iv)Inefficient Bitwise Operations. Floating-point arithmetic does not support bitwise operations which are frequently used. CUDA Math API does support the fmod function [34], which can be employed to extract the least and most significant bits. But it consumes tens of instructions which is extremely inefficient, while using integer native instructions set; the bitwise operation needs only one instruction.(v)Extra Memory and Register File Cost. A DPF occupies 64-bit memory space; however, only 26 or lower bits are used. In this way, times extra cost in memory access and utilization of register files have to be overconsumed. In integer-based implementation, this issue is not concerned since every bit of an integer is utilized.

Before the introduction to the proposed algorithm, several symbols are defined in Table 3 for better reading of the following sections.

4.2. DPF-Based Representation of Large Integer

In Montgomery multiplication, multiply-accumulation operation is frequently used. In CUDA, fused multiply-add (fma) instruction is provided to perform floating-point multiply-add operation.

When each DPF ( and ) contains () significant bits, times of (the initial value of is zero) can be executed, free from the round-off problem.

Note that, in Algorithm 1, there are loops and each loop contains 2 fma operations for each limb, where , where stands for the bit length of the modulus. Thus times of fma operations are needed in total. The following requirement should be met for , where so that the number of the supported fma surpasses the required ones. And the lower means more instructions are required to process the whole algorithm. From Formula (3), to achieve the best performance, is chosen as shown in Table 4.

In this contribution, two kinds of DPF-based large integer representations are proposed, Simplified format and Redundant format.(i)Simplified format is like , where each limb contains at most significant bits. It is applied to represent the input of the fma instruction.(ii)Redundant format is like , where each limb contains at most 53 significant bits. It is applied to accommodate the output of the fma instruction.

4.3. Fundamental Operation and Corresponding Optimization

In DPF-based Montgomery multiplication, the fundamental operations include multiplication, multiply-add, addition, and bit extraction.(i) Multiplication. In the implementation, the native multiplication instruction (mul) is used to perform multiplication. It is required that both multiplicand and multiplier are in Simplified format to avoid round-off problem.(ii) Multiply-Add. In CUDA GPUs, fma instruction is provided to perform floating-point , which is executed in one step with a single rounding. In the implementation, when using fma instruction, it is required that multiplicand and multiplier are both in Simplified format and addend is in Redundant format but less than ().(iii) Bit Extraction. The algorithm needs to extract the most or least significant bits from a DPF. However, as introduced in Section 4.1, the bitwise operation for DPF is inefficient. Two attempts of improvements are made to promote the performance. The first one is introduced in [35]. Using round-to-zero, we can perform to extract the most significant bits and the least significant bits from a DPF . Note that, in , the least significant bits will be left out due to the round-off operation. The second one is converting DPF to integer then using CUDA native 32-bit integer bitwise instruction to handle bit extraction. Through the experiments, it is found that the second method always gives a better performance. Therefore, in most of cases, the second method is used to handle bit extraction. There is only one exception when the DPF is divisible by a , DPF division is used to extract the most significant bits, which can be executed very fast.(iv) Addition. CUDA GPUs provide the native add instruction to perform addition between two DPFs. But as aforementioned, it is inefficient and does not provide support for carry flag. Thus, DPF is first converted into integer; then CUDA native integer add instruction is used to handle addition.

4.4. DPF-Based Montgomery Multiplication Algorithm

With reference to Algorithm 1, in the Montgomery multiplication, , , , , and are all -bit integer (in fact, is bits long, and it is also represented as a -bit integer for a common format). As choosing as limb size, DPF limbs are required to represent , , , and .

In the previous works [19, 21], Montgomery multiplication is parallelized by single limb; that is, each thread deals with one limb (32-bit or 64-bit integer). The one-limb parallelism causes large cost in the thread synchronization and communication, which decreases greatly the overall throughput. To maximize the throughput, Neves and Araujo [22] performed one entire Montgomery multiplication in one thread to economize overhead of thread synchronization and communication, however, resulting in a high latency, about 150 ms for 1024-bit RSA, which is about 40 times of [19].

To make a tradeoff between throughput and latency, in the implementation, we try multiple-limb parallelism, namely, using threads to compute one Montgomery multiplication and each thread dealing with limbs, where . The degree of parallelism can be flexibly configured to offer the maximal throughput with acceptable latency. Additionally, we restrict threads of one Montgomery multiplication within a warp, in which threads are naturally synchronized free from the overhead of thread synchronization and shuffle instruction can be used to share data between threads. To fully occupy thread resource, shall be a divisor of the warp size (i.e., 32).

In the proposed Montgomery multiplication , the inputs , , and are in Simplified format and the initial value of is 0. Two phases are employed to handle one Montgomery multiplication, Computing Phase and Converting Phase. In Computing Phase, Algorithm 2 is responsible to calculate , whose result is represented in Redundant format. Then in Converting Phase, is converted from Redundant format to Simplified format applying Algorithm 3.

Input:
: Thread ID, where ;
: Loops num, where ;
: Multiplicand, where ;
: Multiplier, where ;
: Modulus, where ;
: Integer, where ;
Output:
Redundant-format sub-result
, where ;
(1)
(2) for     do
(3)
Step ( 1):    
(4)for    to   do
(5)
(6)end for
Step ( 2):  
(7)if      then
(8)
(9)
(10)
(11)end if
(12)
Step ( 3):  
(13)for    to   do
(14)
(15)end for
Step ( 4):  
(16) ?
(17) ?()
(18)for    to   do
(19)
(20)end for
(21)
(22) end for
Input:
: Thread ID;
: Number of processed limbs per thread;
: Number of threads per Montgomery multiplication, where ;
: Redundant-format sub-result, where
;
Output:
: Simplified-format sub-result, where
;
(1)
(2) for   to   do
(3)
(4)
(5) end for
(6) while  carry of any thread is non-zero  do
(7)
(8)for   to   do
(9)
(10)
(11)end for
(12) end while
4.4.1. Computing Phase

Algorithm 2 is a -limb parallelized version of Algorithm 1. In the algorithm, both the inputs , , and and the output are divided equally into groups and each group is assigned into one single thread. In each Thread , the variables with index are processed. Note that if the index of certain variable is larger than , the variable shall be padded with zero. In this section, the lowercase variables (index varies from 0 to ) indicate the private registers stored in the thread and the uppercase ones (index varies from 0 to ) represent the global variables. For example, in Thread represents . And = shuffle means that current thread obtains variable of Thread and stores it into variable using the shuffle instruction.

With reference to Algorithm 1, we introduce the proposed parallel Montgomery multiplication algorithm step by step.

(1) : in this step, each Thread calculates . Note that each thread stores a group of . Thus for each Loop , firstly the corresponding shall be broadcast from certain thread to all threads. In the view of single thread, corresponds to of Thread . The shuffle instruction is used to conduct this broadcast operation. Then each thread executes , where .

(2) : is only related to , which is stored in Thread 0 as . Therefore, in this step, this calculation is only conducted in Thread 0, while other threads are idle. Note that is in Redundant format, and the least significant bits temp of shall be firstly extracted before executing . And in next step, will act as a multiplier; hence then, the same bit extraction shall be also applied to .

(3) : in this step, each Thread calculates . Because is stored only in Thread 0, similar to Step (1), it should also be broadcast to all threads. Then each thread executes , where .

(4) : in this step, each thread conducts a division by by shifting right operation. In the view of the whole algorithm, shifting right can be done by executing () and padding with zero. While in the view of single thread, there are 2 noteworthy points. The first point is that Thread needs to execute but is stored in Thread ; thus Thread needs to propagate its stored to Thread . The second point is that is represented in Redundant format; when executing , the upper bits of the least significant limb need to be stored. Thus, in Thread 0, is calculated instead of in other threads. is at most bits long, and due to the shift-right operation, in each loop is not the same. Thus can be only accumulated within , which does not cause round-off problem. Note that is divisible by ; as introduced in Section 4.3, a simple division can be used to efficiently extract the most significant bits.

After Computing Phase, is in Redundant format. Next, we use Converting Phase to convert into Simplified format.

4.4.2. Converting Phase

In Converting Phase, is converted from Redundant format to Simplified format: every adds the carry ( does not execute this addition) and holds the least significant bits of the sum and propagates the most significant bits as new carry to . However, this method is serial, and the calculation of every depends on the carry that produces, which does not comply with the parallelism architecture of GPU. In practice, parallelized method is applied to accomplish Converting Phase, which is shown in Algorithm 3.

Algorithm 3 uses symbol to denote that 53-bit integer c is divided into most significant bits and least significant bits . Firstly, all threads execute a chain addition for its and store the final carry. Then every Thread (except Thread ) propagates the stored carry to Thread using shuffle instruction and then repeats chain addition with the propagated carry. This step continues until carry of every thread is zero, which can be checked by the CUDAany() voting instruction [4]. The number of the iterations is in the worst case, but for most cases it takes one or two. Compared with the serialism method, over 75% execution time would be economized in Converting Phase using the parallelism method.

After Converting Phase, is in Simplified format. An entire Montgomery multiplication is completed.

5. RSA Implementation

This section introduces the techniques on the implementation of Montgomery exponentiation, CRT computation, and pre-/postcomputation format conversion.

5.1. Montgomery Exponentiation

As introduced in Section 3.3, RSA with CRT technology requires two -bit Montgomery exponentiation to accomplish -bit RSA decryption. With the binary square-and-multiply method, the expected number of modular multiplications is for -bit modular exponentiation. The number can be reduced with -ary method given by [36] that scans multiple bits, instead of one bit of the exponent. Jang et al. [19] used sliding window technology Constant Length Nonzero Windows (CLNW) [37] to reduce the number of Montgomery multiplications further. But it is not suitable for our design, in which an entire warp may contain more than one Montgomery multiplication and each Montgomery multiplication has different exponentiation which leads to different scanning step size and execution logic. These differences would cause warp divergence, largely decreasing the overall performance. And -ary method is timing-attack-proof, the calculation time of which would not be affected by bit pattern. For all the above reasons, we choose to employ -ary method not the CLNW method.

In -ary method, where , the window size shall be chosen properly to decline the number of modular multiplications () and memory access (), and the window size is employed as it offers the least computational cost and memory access. In this contribution, -ary method is implemented and reduces the number of modular multiplications from 1536 to 1259 for 1024-bit modular exponentiation, achieving 17.9% improvement. In -ary method, () precompute table () needs to be stored into memory for each Montgomery exponentiation. The memory space required (about 512 KB) is far more than the size of shared memory (at most 48 KB); thus they have to be stored into global memory. Global memory load and store operations consume hundreds of clock circles. To improve memory access efficiency, the Simplified format precompute tables are converted into -bit integer format using GPU format-convert instruction and then stored in global memory. This optimization economizes about 50% memory access consuming.

In Montgomery exponentiation, the exponent is represented in integer. Similar to the processing of shared data in Montgomery multiplication , each thread stores a piece of and uses shuffle to broadcast from certain thread to all threads.

Algorithm 4 shows how to conduct Montgomery exponentiation where MonMul means calculating using Algorithms 2 and 3.

Input:
: Simplified-format Base number;
: -bit Integer Exponent;
: Simplified-format Modulus;
: Radix, ;
Output:
;
(1)
(2) for   to   do
(3) MonMul
(4) Store in Global Memory
(5) end for
(6) Decompose -bit into 6-bit limb , where
, then assign them into threads
equally.
(7)
(8) for   to 0  do
(9) Use shuffle to obtian
(10) Repeat MonMul for 6 times
(11) MonMul
(12) end for
(13)
5.2. CRT Computation

In the first implementation, GPU only took charge of the Montgomery exponentiation. And the CRT computation (2) was offloaded to CPU using GNU multiple precision (GMP) arithmetic library [38]. But we find the low efficiency of CPU computing power greatly limits the performance of the entire algorithm, which occupies about 15% of the execution time. Thus we make attempt to integrate the CRT computation into GPU.

For CRT computation, a modular subtraction and a multiply-add function are additionally implemented. Both functions are parallelized in the threads which take charge of Montgomery exponentiation. The design results in that the CRT computation occupies only about 1% execution time and offers the independence of CPU computing capability. The scheme is shown in Figure 1.

5.3. Pre- and Postcomputation Format Conversion

RSA algorithm is built on large integer arithmetic, whose input and output are both large integers. The Digital Signature Standard (FIPS PUB 186-4) [39] by National Institute of Standards and Technology (NIST) regulates how to conduct conversion between bit-string and integer. Unfortunately, our DPF-based large integer representation is not consistent with it. Thus before and after applying the proposed DPF-based RSA algorithm, the conversion between DPF and 32-bit integer shall be conducted.

The first attempt was made to conduct the conversion in CPU, which is easy to implement, as shown in Figure 2(a): before and after GPU kernel, converting its DPF-format input and output from or to bit-string using bitwise and format conversion instructions of CPU. However, this strategy has two significant drawbacks.

Firstly, coded in DPF, the input and output require much more data transfer (cudaMemcpy) between GPU and CPU. Specifically, since each Simplified format DPF has only 23-bit significand, DPFs needs to store a 2048-bit integer, which occupies bits and causes )/2048 = 181% extra data transfer overhead. In fact, the overhead is quite high. As introduced in Section 3.3, for -bit RSA decryption with CRT technology, it requires bits for input and bits for output, bits in total. Assume 896 2048-bit DPF-based RSA decryptions are calculated once, at rate of 6 GB/s for data transfer through PCIe between GPU and CPU; it requires  ms for cudaMemcpy, while the total duration for RSA-2048 decryption is only about 20 ms.

Secondly, converting data format in CPU is inefficient. The simplest way is using serial operations to convert before and after GPU kernel in CPU. For example, we complete 896 RSA-2048 transformations which take over 1.1 ms using CPU’s (Intel E5-2697 v2) single core in the experiment. But for GPU, it is adept in accomplishing conversion with thousands threads, and the GPU threads can be used to accomplish such amount of RSA in about 0.1 ms in GTX TITAN.

With full consideration for above two drawbacks, we turn to transfer data between GPU and CPU using integer rather than DPF, and in GPU kernel threads for each RSA decryption accomplish data conversion before and after RSA decryption calculation, as shown in Figure 2(b).

Precomputation Conversion. First, CPU transfers the 32-bit-integer-format inputs to global memory of GPU. According to its own DPFs needed to process in Algorithms 2 and 3, each CUDA thread extracts bits from the corresponding 32-bit integers, reconstructs them in -bit-integer format using bitwise instructions, and finally converts -bit-integer-format integer into Simplified format DPFs using integer-to-DPF instruction as the input of the proposed algorithm.

Postcomputation Conversion. The postcomputation conversion is a reverse procedure of the precomputation conversion, converting the Simplified format outputs into 32-bit-integer-format, then returning them to CPU.

This method largely reduces the cost of data transfer and leverages the computing power of GPUs. For 2048-bit RSA decryption, the overall latency decreases by up to 1.2 ms (5%).

6. Performance Evaluation

This section presents the implementation performance and summarizes the results for the proposed algorithm. Relative assessment is also presented by considering related implementation. The hardware and software configuration used in the experiment are listed in Table 5.

6.1. Experiment Result

Applying the DPF-based Montgomery multiplication algorithm and RSA implementation, respectively, described in Sections 4 and 5, RSA-2048/3072/4096 decryptions are implemented in NVIDIA GTX TITAN, respectively.

Several configuration parameters may affect the performance of the kernel, including the following:(i)Batch Size: the number of RSA decryptions per GPU kernel launch(ii)Threads/Block: the number of CUDA threads contained in a CUDA block(iii)Threads/RSA: the number of CUDA threads assigned for each RSA decryption(iv)Regs/Thread: the maximum number of registers assigned for each CUDA thread; both Regs/Thread and Threads/Block Regs/Thread should be restricted within the GPU hardware limitation (i.e., 255 32-bit registers per thread and 65536 per CUDA block in GTX TITAN). Table 6 demonstrates the performance for RSA-2048/3072/4096 with a set of combinations for all the above configurations.

Note that the configuration Batch Size is chosen from a small number to the hardware limitation. Figure 3 summarizes the impact of Batch Size on the performance, which indicates the larger Batch Size can always lead to the better performance. It is easy to understand, because GPUs pipeline the instructions to leverage instruction-level parallelism [4], and configuring as large Batch Size as possible can keep the pipeline busy most of time and fully utilize the computing resource in GPUs, thereby yielding better performance. As the throughput is the most concerned factor in this contribution, Table 6 chooses the maximum Batch Size supported by the GPU hardware for assessment (half of the maximum Batch Size is also provided for comparison), although some lower configurations of Batch Size bring a lower latency with an almost equivalent throughput.

Another important factor on performance is Threads/RSA. In Table 6, Column Threads/RSA = indicates threads are used to process a Montgomery multiplication and threads to process one RSA decryption. As discussed in Section 4.4, theoretically, parallelizing single computing task into too many threads would result in poor throughput; meanwhile, parallelism of too few threads may lead to a high latency. Towards the practical usage and also emphasizing a high throughput, in the experiment, on the premise of an acceptable latency, as few Threads/RSA as possible are applied to offer a high throughput.

However, even without the consideration for acceptable latency, at some points, due to the hardware limitation for register file per thread, parallelism using too few threads may not promote but even largely decrease the overall throughput. An example of the experiment is RSA-4096 with Threads/RSA = as shown in Table 6. For RSA-2048, the throughput of Threads/RSA = always precedes Threads/RSA = . However, for RSA-4096 with Threads/RSA = , the growing number of variables required exceeds tremendously the hardware limitation (when Batch Size is , 127 registers per thread in GTX TITAN); thus many variables have to spill into off-chip local memory which is hundreds times slower than registers. In this way, the throughput is much less than Threads/RSA = . In fact, the proposed DPF-based algorithm is more sensitive for the number of available registers since it consumes more register file than normal integer-based one as specified in Section 4.1.

To summarize, for best practice of the proposed DPF-based algorithm, many factors should be comprehensively taken into consideration to suit both properties of GPU hardware and RSA modulus size, especially Threads/RSA and Regs/Thread with higher and higher requirement of RSA modulus size (7680 bits and more).

6.2. Performance Comparison

This section provides two groups of comparisons depending on different implementation mechanisms, floating-point-based and integer-based.

6.2.1. Proposed versus Floating-Point-Based Implementation

Bernstein et al. [5] employed the floating-point arithmetic to implement 280-bit Montgomery multiplication. Table 7 shows performance of the work [5] and ours. Note that the work [5] only implemented the 280-bit modular multiplication; thus its performance is scaled by as the performance of the 1024-bit one, and it is further scaled by the difference of the floating processing power of the corresponding GPU. The scaled result is shown in the row “1024-bit MulMod (scaled).”

Table 7 demonstrates the resulting implementation achieves 13 times speedup compared to the performance of [5]. Part of the performance promotion results from the advantage DPF achieves over SPF as discussed in Section 3.2. The reason why they did not utilize DPF is that GTX 295 they used does not support DPF instructions. The second reason of the advantage comes from the process of Montgomery multiplication. Bernstein et al. used Separated Operand Scanning (SOS) Montgomery multiplication method which is known inefficient [20]. And they utilized only 28 threads of a wrap (consisting of 32 threads), which wastes 1/8 processing power. The third reason is that we used the CUDA latest shuffle instruction to share data between threads, while [5] used shared memory. As Section 3.1 introduced, the shuffle instruction gives a better performance than shared memory. The last reason lies in that Bernstein et al. [5] used floating-point arithmetic to process all operations, some of which are more efficient using integer instruction such as bit extraction. By contrast, we flexibly utilize integer instructions to accomplish these operations.

6.2.2. Proposed versus Integer-Based Implementation

Previous works [7, 19, 2123] are all integer-based. Thus we scale their CUDA platform performance based on the integer processing power. The parameters in Table 8 origin from [4, 40], but the integer processing power is not given directly. Taking SM number, processing power of each SM, and Shader Clock into consideration, we calculate integer multiplication instruction throughput Int Mul. Among them, 8800 GTS and GTX 260 support only 24-bit multiply instruction, while the other platforms support 32-bit multiply instruction. Hence, we adjust their integer multiply processing capability by a correction parameter (unadjusted data is in parenthesis). Throughput Scaling Factor is defined as Int Mul ratio between the corresponding CUDA platform and GTX TITAN. And Throughput Scaling Factor is also defined, as the Shader Clock ratio.

Table 8 summarizes the resulting performance of each work. We divide each resulting performance by the corresponding Scaling Factor listed in Table 8 as the scaled performance. Note that the RSA key length of Neves and Araujo [22] is 1024 bits, while ours is 2048 bits, and we multiply it by an additional factor (1/4 for the performance of modular multiplication, 1/2 for the half bits of the exponent).

From Table 8, at the modular multiplication level, the proposed implementation outperforms the others by a great margin for floating-point-based RSA. We achieve nearly 6 times speedup compared to the work [23], and even at the same CUDA platform, we obtain 221% performance of the work [21].

At RSA implementation level, the proposed implementation also shows a great performance advantage, 291% performance of the work [22] for RSA-2048. Note that RSA-1024 decryption of [22] has latency of about 150 ms, while 2048-bit RSA decryption of ours reaches 21.52 ms when it reaches the throughput peak. Yang [24] reported the latency for RSA-2048 throughput of 5,244 (scaled as 31,782) is 195.27 ms (scaled as 225.60 ms). The throughput of the proposed RSA-2048 implementation is 132% performance of Yang’s work, but the latency is 9.4% of their work [24]. Our peak RSA-2048 throughput is slightly slower than the fastest integer-based implementation [7], while our latency is 3 times faster. For RSA-4096 decryption, Emmart and Weems use distributed model [7] based on CIOS method with distributed integer values to report a throughput of 5257 RSA-4096 decryption on a GTX780Ti, scaling the performance to a GTX TITAN is 4,693, and our work is 1.23 times faster.

The performance advantage lies mainly in the utilization of floating-point processing power and the superior handling of Montgomery multiplication, which overcomes the problems addressed in Section 4.1. The DPF-based representation of large integer and Montgomery multiplication is carefully scheduled to avoid round-off problem and decrease largely the number of the expensive carry flag handling processes. For the inefficient add instruction and bitwise operations of DPF, the integer instruction set is flexibly employed to supplement the deficiency. And precompute table storage optimization (in Section 5.1) and the design of pre- and postcomputation format conversion (in Section 5.3) mitigate the performance influence brought by extra memory and register file cost.

Besides, compared with the works using multiple threads to process a Montgomery multiplication [19, 21], another vital reason is that we use more efficient shuffle instruction to handle data sharing instead of shared memory and employ more reasonable degree of thread parallelism to economize the overhead of thread communication.

6.3. Discussion

In 2016 GPU Technology Conference, NVIDIA released the latest NVIDIA’s Tesla P100 accelerator using the Pascal GP100 which has a huge improvement in floating-point computing power, especially DPF. Tesla P100 can deliver 5.3 TFLOPS of DPF performance which is 3.4 times our target platform [41]. Meanwhile the number of the registers for each CUDA core is doubled [41], which is essential for DPF-based RSA implementation. The performance of proposed DPF-based algorithm can be improved by at least three times based on all above improvements, which will further widen the gap between the DPF-based and the traditional integer-based algorithms.

7. Conclusion

In this contribution, we propose a brand new approach to implement high-performance RSA cryptosystems in the latest CUDA GPUs by utilizing the powerful floating-point computing resource. Our results demonstrate that the floating-point computing resource is a more competitive candidate for the asymmetric cryptography implementation in CUDA GPUs. In NVIDIA GeForce GTX TITAN, our resulting RSA-2048 decryption reaches a throughput of 42,211 operations per second, which achieves 13 times the performance of the previous floating-point-based implementation. For RSA-3072/4096 decryption, our peak throughput is 12,151 and 5,790, and RSA-4096 implementation is 1.23 times faster than the best integer-based result. We also hope our endeavor can shed light on future research and inspire more case studies on GPU using floating-point computing power as well as other asymmetric cryptography. We will apply these designs to arithmetic, such as the floating-point-based ECC implementation, and exploit the floating-point power of Tesla P100.

Disclosure

A preliminary version of this paper appeared under the title Exploiting the Floating-Point Computing Power of GPUs for RSA, in Proc. Information Security, 17th International Conference, ISC 2014, Hong Kong, October 12–14, 2014 [42] (Best Student Paper Award). Dr. Yuan Zhao participated in this work when he studied in Chinese Academy of Sciences and now works in Huawei Technologies, Beijing, China.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was partially supported by the National 973 Program of China under Award no. 2014CB340603 and the Strategic Priority Research Program of Chinese Academy of Sciences under Grant XDA06010702.