Security and Communication Networks

Volume 2017, Article ID 3508786, 15 pages

https://doi.org/10.1155/2017/3508786

## Utilizing the Double-Precision Floating-Point Computing Power of GPUs for RSA Acceleration

^{1}State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China^{2}Data Assurance and Communication Security Research Center, Chinese Academy of Sciences, Beijing, China^{3}School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China

Correspondence should be addressed to Fangyu Zheng; nc.ca.si@gnehzyf

Received 23 May 2017; Accepted 8 August 2017; Published 17 September 2017

Academic Editor: Zhe Liu

Copyright © 2017 Jiankuo Dong et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Asymmetric cryptographic algorithm (e.g., RSA and Elliptic Curve Cryptography) implementations on Graphics Processing Units (GPUs) have been researched for over a decade. The basic idea of most previous contributions is exploiting the highly parallel GPU architecture and porting the integer-based algorithms from general-purpose CPUs to GPUs, to offer high performance. However, the great potential cryptographic computing power of GPUs, especially by the more powerful floating-point instructions, has not been comprehensively investigated in fact. In this paper, we fully exploit the floating-point computing power of GPUs, by various designs, including the floating-point-based Montgomery multiplication/exponentiation algorithm and Chinese Remainder Theorem (CRT) implementation in GPU. And for practical usage of the proposed algorithm, a new method is performed to convert the input/output between octet strings and floating-point numbers, fully utilizing GPUs and further promoting the overall performance by about 5%. The performance of RSA-2048/3072/4096 decryption on NVIDIA GeForce GTX TITAN reaches 42,211/12,151/5,790 operations per second, respectively, which achieves 13 times the performance of the previous fastest floating-point-based implementation (published in Eurocrypt 2009). The RSA-4096 decryption precedes the existing fastest integer-based result by 23%.

#### 1. Introduction

With the rapid development of e-commerce and cloud computing, the high-density calculation of digital signature and asymmetric cryptographic algorithms such as the Elliptic Curve Cryptography (ECC) [1, 2] and RSA [3] algorithms is urgently required. However, without significant development in recent years, CPUs are more and more difficult to keep pace with such rapidly increasing demands. Specialized for the compute-intensive and high-parallel computations required by graphics rendering, GPUs own much more powerful computation capability than CPUs by devoting more transistors to arithmetic processing unit rather than data caching and flow control. With the advent of NVIDIA Compute Unified Device Architecture (CUDA) technology, it is now possible to perform general-purpose computation on GPUs. Due to their powerful arithmetic computing capability and moderate cost, many researchers resort to GPUs to perform cryptographic acceleration.

Born for high-definition 3D graphics, GPUs require high-speed floating-point processing capability; thus the floating-point units residing in GPUs achieve dramatical increase. From 2010 to the present, floating-point computing power of CUDA GPUs grows almost 10 times, from 1,345/665.6 (Fermi architecture) Giga Floating-point Operations Per Second (GFLOPS) to 10,609/5304 (Pacal architecture) GFLOPS for single/double-precision floating-point arithmetic. By contrast, integer multiplication arithmetic of NVIDIA GPUs gains only 25% performance improvement from Fermi to Kepler architecture; the latest Maxwell and Pascal architectures even do not provide dedicated device instructions for 32-bit integer multiply and multiply-add arithmetic [4].

However, the floating-point instruction set is inconvenient to realize large integer modular multiplication which is the core operation of asymmetric cryptography. More importantly, the floating-point instruction set in the previous GPUs shows no significant performance advantage over the integer one. To the best of our knowledge, Bernstein et al. [5] are the first and the only one to utilize the floating-point processing power of CUDA GPUs for asymmetric cryptography. However, compared with their later work [6] based on the integer instruction set, the floating-point-based one only achieves about 1/6 performance. Nevertheless, with the rapid development of GPU floating-point processing power, fully utilizing the floating-point processing resource will bring great benefits to the asymmetric cryptography implementation in GPUs.

Based on the above observations, in this paper, we propose a new approach to implement high-performance RSA by fully exploiting the double-precision floating-point (DPF) processing power in CUDA GPUs. In particular, we propose a DPF-based large integer representation and adapt the Montgomery multiplication algorithm to it. Also, we flexibly employ the integer instruction set to supplement the deficiency of the DPF computing power. Besides, we fully utilize the latest* shuffle* instruction to share data between threads, which makes the core algorithm Montgomery multiplication a non-memory-access design, decreasing greatly the performance loss in the thread communication. Additionally, a method is implemented to apply the proposed DPF-based RSA decryption algorithm in practical scenarios where the input and output shall be in bit-format, improving further the overall performance by about 5%, by decreasing data transfer consumption using smaller data format and leveraging GPUs to promote the efficiency of data conversion.

With these improvements, in GTX TITAN, the performance of the proposed RSA implementation increases dramatically compared with the previous works. In particular, the experimental results of RSA-2048/3072/4096 decryption with CRT reach the throughput of 42,211/12,151/5,790 operations per second and achieve 13 times the performance of the previous floating-point-based implementation [5], and RSA-4096 decryption is 1.23 times the performance of the existing fastest integer-based one [7].

The rest of our paper is organized as follows. Section 2 introduces the related work. Section 3 presents the overview of GPU, CUDA, floating-point elementary knowledge, RSA, and Montgomery multiplication. Section 4 describes our proposed floating-point-based Montgomery multiplication algorithm in detail. Section 5 shows how to implement RSA decryption in GPUs using our proposed Montgomery multiplication. Section 6 analyses performance of proposed algorithm and compares it with previous work. Section 7 concludes the paper.

#### 2. Related Work

Many previous papers demonstrate that the GPU architecture can be used as an asymmetric cryptography workhorse. For ECC, Antão et al. [8, 9] and Pu and Liu [10] employed the Residue Number System (RNS) to parallelize the modular multiplication into several threads. Bernstein et al. [6] and Leboeuf et al. [11] used one thread to handle a modular multiplication with Montgomery multiplication. Pan et al. [12], Zheng et al. [13], Bos [14], and Szerwinski and Güneysu [15] used fast reduction to implement modular multiplication over the Mersenne prime fields [16].

Unlike ECC scheme, RSA calculation requires longer and unfixed modulus and depends on modular exponentiation. Before NVIDIA CUDA appeared on the market, Moss et al. [17] mapped RNS arithmetic to the GPU to implement a 1024-bit modular exponentiation. Later in CUDA GPUs, Szerwinski and Güneysu [15] and Harrison and Waldron [18] developed efficient modular exponentiation by both Montgomery multiplication Coarsely Integrated Operand Scanning (CIOS) method and the RNS arithmetic. Jang et al. [19] presented a high-performance SSL acceleration using CUDA GPUs. They parallelized the Separated Operand Scanning (SOS) method [20] of the Montgomery multiplication by single limb. Jeffrey and Robinson [21] used the similar technology to implement 256-bit, 1024-bit, and 2048-bit Montgomery multiplication in GTX TITAN. Neves and Araujo [22] and Henry and Goldberg [23] used one thread to perform single Montgomery multiplication to economize the overhead of thread synchronization and communication; however, their implementations resulted in a very high latency. Emmart and Weems [7] applied one thread to a row oriented multiplication for 1024-bit modular exponentiation and a distributed model based on CIOS method for 2048-bit modular exponentiation. Yang [24] used an Integrated Operand Scanning (IOS) method with single limb or two limbs for Montgomery multiplication algorithm to implement RSA-1024/2048 decryption.

Note that all above works are based on the CUDA integer computing power. Bernstein et al. are the pioneers to utilize CUDA floating-pointing processing power in asymmetric cryptography implementations [5]. They used 28 single precision floating-points (SPFs) to represent 280-bit integer and implemented the field arithmetic. However, the result was barely satisfactory. Their later work [6] in the same platform GTX 295 using the integer instruction set performed almost 6.5 times the throughput of [5].

#### 3. Background

In this section, a brief introduction to the basic architecture of modern GPUs, the floating-point arithmetic, the basic knowledge of RSA, and the Montgomery multiplication are provided.

##### 3.1. GPU and CUDA

CUDA is a parallel computing platform and programming model that enables dramatical increase in computing performance by harnessing the power of GPU. It is created by NVIDIA and implemented by the GPUs which they produce [4].

The target platform, GTX TITAN, is a CUDA-compatible GPU (codename GK-110) with the CUDA Compute Capability 3.5 [25], which contains 14 streaming multiprocessors (SMs). 32 threads (grouped as a* warp*) within one SM concurrently run in a clock. Following the Single Instruction Multiple Threads (SIMT) architecture, each GPU thread runs one instance of the kernel function. A warp may be preempted when it is stalled due to memory access delay, and the scheduler may switch the runtime context to another available warp. Multiple warps of threads are usually assigned to one SM for better utilization of the pipeline of each SM. These warps are called one* block*. Each SM has 64 KB on-chip memory which can be configured as fast* shared memory* and L1 cache; the maximum allocation of shared memory is 48 KB [25]. Each SM also possesses 64 K 32-bit registers. All SMs share 6 GB 256-bit wide slow* global memory*, cached read-only* texture memory,* and cached read-only* constant memory* [4, 25]. Each SM of GK-110 owns 192 single precision CUDA cores, 64 double-precision units, 32 special function units (SFUs), and 32 load/store units [25], yielding a throughput of 192 SPF arithmetic, 64 DPF arithmetic, 160 32-bit integer add, and 32 32-bit integer multiplication instructions per clock circle [4, 25].

NVIDIA GPUs of Compute Capability 3.x or later bring a new method of data sharing between threads. Previously, shared data between threads requires separated store and load operations to pass data through* shared memory*. First introduced in the NVIDIA Kepler architecture,* shuffle* instruction [4, 25] allows threads within a warp to share data. With shuffle instruction, threads within a warp can read any value of other threads in any imaginable permutations [25]. NVIDIA conducted various experiments [26] on the comparison between shuffle instruction and shared memory, which show that shuffle instruction always gives a better performance than shared memory. This instruction is available in GTX TITAN.

##### 3.2. Integer and Floating-Point Arithmetic in GPU

NVIDIA GPUs with CUDA Compute Capability 3.x support both integer and floating-point arithmetic instruction sets. This section introduces some most concerned instructions in asymmetric cryptographic algorithm implementations, including add and multiply(-add) operations.

*Integer*. Integer arithmetic instructions add.cc, addc, mad.cc, and madc are provided to perform multiprecision add and multiply-add operations, which reference an implicitly specified condition code register (CC) having a single carry flag bit (CC.CF) holding carry-in/carry-out [27]. With this native support for multiprecision arithmetic, most of the previous works [6, 8–11, 13–15, 18, 19, 21–23] chose to use integer instructions to implement large integer arithmetic in asymmetric cryptographic algorithms.

One noteworthy point is that multiply or multiply-add instructions of NVIDIA GPUs with Compute Capability 3.x have a unique feature: when calculating the 32-bit multiplication, the whole 64-bit product cannot be obtained using single instruction but requires two independent instructions (one is for lower-32-bit half and the other for upper-32-bit half). Although the whole multiplication “instruction” (mul.wide) is provided which is used in [22, 23], it is a* virtual* instruction but not the native instruction, which is broken into 2 native instructions (mul.lo. and mul.hi) when running on the GPUs.

*Floating-Point*. Floating-point arithmetic instructions in CUDA GPUs comply with 754-2008 IEEE Standard for Floating-Point Arithmetic [28]. Among five basic formats which the standard defines, 32-bit binary (i.e., SPF) and 64-bit binary (i.e., DPF) are supported in NVIDIA GPUs. As demonstrated in Table 1, the real value assumed by a given SPF or DPF data with a sign bit , a given biased exponent , and a significand precision is . Therefore, each SPF or DPF can, respectively, represent 24-bit or 53-bit integer.