Abstract

Teleoperated robotic systems are those in which human operators control remote robots through a communication network. The deployment and integration of teleoperated robot’s systems in the medical operation have been hampered by many issues, such as safety concerns. Elliptic curve cryptography (ECC), an asymmetric cryptographic algorithm, is widely applied to practical applications because its far significantly reduced key length has the same level of security as RSA. The efficiency of ECC on GF (p) is dictated by two critical factors, namely, modular multiplication (MM) and point multiplication (PM) scheduling. In this paper, the high-performance ECC architecture of SM2 is presented. MM is composed of multiplication and modular reduction (MR) in the prime field. A two-stage modular reduction (TSMR) algorithm in the SCA-256 prime field is introduced to achieve low latency, which avoids more iterative subtraction operations than traditional algorithms. To cut down the run time, a schedule is put forward when exploiting the parallelism of multiplication and MR inside PM. Synthesized with a 0.13 um CMOS standard cell library, the proposed processor consumes 341.98k gate areas, and each PM takes 0.092 ms.

1. Introduction

In teleoperated robotic systems, human operators, often geographically distant, interact with and control robots through a communication network. Teleoperated robotic systems have many applications such as bomb disposal, search and rescue, robotic surgery, and medical operation. Teleoperated robotic surgery is a particularly important application of medical operation. Expert surgery is able to be performed remotely and without direct human presence. It is expected to have a significant impact on the quality of medical services in isolated regions, battlefields, or disaster areas. With the development of teleoperated systems and robots, the deployment and integration of teleoperated robots in the medical operation have encountered many problems such as safety concerns [1], time delay [2], and bilateral control [3]. Security is one of the biggest issues that hamper the deployment and integration of teleoperated robots and there are some works on it [4].

Telerobotic surgery is expected to be employed in extreme conditions, where teleoperated robots may have to operate in harsh and low-power conditions, connecting to the Internet with potential loss. As depicted in Figure 1, the last communication link may even be a wireless link to a drone or a satellite, providing the connection to a trusted facility (possibly a large hospital with an established infrastructure) [5].

In such operating conditions, the security of the long-range control is significant, since if the teleoperated robotics are attacked by hackers, potential damage might be caused due to loss of proper control. Besides, verifying that these requirements are established and maintained during a teleoperated procedure is necessary [6].

In harsh conditions, low-power and time delay are significant. Hence, the security process, like digital signature/verification and encryption/decryption, should be implemented by hardware acceleration. Compared with software implementation, hardware implementation has many advantages, such as high efficiency, low power consumption, and safety. ECC is a kind of public key cryptography algorithm that can provide these security processes, proposed in 1986 by Miller [7] and Koblitz [8]. It has been demonstrated to be used as an alternative to the classical RSA [9] thanks to its significantly reduced key lengths [10]. ECC when using 160–256 bits provides similar security compared with RSA or discrete logarithm schemes over finite fields (1024–4096 bits) [11]. SM2, as an ECC algorithm, was included in ISO/IEC14888-3/AMD1 in November 2017.

Considerable efforts have been made to implement the ECC with hardware as can be noticed in [1222], during which MM operation is widely used for PM in ECC. In order to accelerate the MM, the proposed designs should be considered into three categories [23]: (1) the recommended prime modular multiplication algorithm, (2) Montgomery multiplication algorithm, and (3) the interleaved modular multiplication algorithm. Among those three categories, the first category is the fastest and it is limited by the specific prime field, such as NIST and SCA-256. The architecture in [12] equips Montgomery multiplier among 8-bit × 8-bit to 64-bit × 64-bit aiming to improve area efficiency and reduce delay at the cost of retarding speed. The designs in [9, 20] are based on the recommended prime modular multiplication algorithm. However, those MR algorithms only contain one stage, which will generate an intermediate result Z, such as in [9] and in [20]. Besides, an extra calculation is required to get the final result . Notably, the architecture in [9] adopts a full-word 256-bit × 256-bit multiplier, and all the calculations are executed in the SCA-256 prime field. In MR operation of design [9], 13 subtractions are taken to transfer the intermediate value to the final value in the most needed situation, following with large latency.

Traditional software methods to implement cryptography algorithms will bring larger time delay and power consumption. However, hardware implementation can resolve these issues. Motived to provide highly efficient safety assurance for teleoperated systems, we realize ECC by hardware implementation. The main contributions of this paper include the following:We propose a high-performance hardware processor, which adopts a half-word multiplier to improve performance while reducing hardware consumption. Compared with most of the other works, it has a better trade-off between performance and hardware overhead.The TSMR algorithm in SCA-256 is proposed to implement low latency. The algorithm obtains the intermediate result Z, which requires one subtraction to get the final result Z. Compared with the traditional method [9] which obtains intermediate result Z, our method avoids lots of subtractions to get the final result.TSMR algorithm is implemented by a carry-save adder architecture to reduce latency and hardware overhead. Combined with Karatsuba-Ofman (KO) multiplication algorithm and pipeline design, MM requires an average of five clock cycles, even though one clock cycle for modular reduction and five clock cycles for multiplication are required.

The arrangement of this paper is as follows. In Section 2, the elliptic curve and PM are introduced. In Section 3, high-performance architecture is illustrated. Then, the proposed method is implemented and validated in Section 4. Finally, in Section 5, the conclusion of this work is provided.

2. Mathematical Background

2.1. Elliptic Curve

A nonsupersingular elliptic curve (EC) over GF (p) is defined as a set of points (x, y) that conform to equation (1), also known as the Weierstrass equation, and an infinity point additionally:where and are parameters, identifying the EC which satisfied .

2.2. Point Multiplication

PM describes a transformation that identical EC points add up to one, denoted as a scalar times an EC point “,” where , and represents the binary length of . In this work, the width NAF addition-subtraction method [24], given in Algorithm 1, is applied to point multiplication.

Input: width , scalar k, EC point p
 Output: EC point
(1)Precomputation:
(2)Compute
(3)
(4)for from downto 0 do
   
   if then
   if then
   else
(5)Return

PM operation is the elemental operation of ECC and is performed as a sequence of elliptic curve addition (ECADD) and elliptic curve doubling (ECDBL). Let EC point ; the ECADD is defined as and ECDBL is defined as . To avoid time-consuming modular inversion/division operation, ECADD reaches the fastest efficiency in mixed affine-Jacobian coordinates, while there is ECDBL in Jacobian coordinates [25].

ECADD in mixed affine-Jacobian coordinates and ECDBL in Jacobian coordinates are given in the two following equations:

3. High-Performance Architecture of SM2

The PM architecture based on full-word multipliers is described below. TSMR and full-word multiplication constitute MM, while the binary modular inversion algorithm in [26] was applied to execute modular inversion (MI) operation.

3.1. Modular Reduction

SCA-256 has the characteristic that it can be denoted as . The traditional MR for SCA-256 [9] is given in Algorithm 2. After the fast reduction operation, the intermediate value can be represented aswhere Z∈[0, 14p). It will cost at most 13 subtractions to get the final result Z ∈ [0, p). Since the modular reduction would be computed in a single clock cycle, the repetitive subtractions have a significant influence on the latency and bring about a lot of hardware resources consumption.

Input: Integer c = (c15, c14,…, c0) in base 232; .
Output: c mod p
(1)s1 = (c7, c6, c5, c4, c3, c2, c1, c0), s2 = (c15, c14, c13, c12, c11, 0, c9, c8),
s3 = (c14, 0, c15, c14, c13, 0, c14, c13), s4 = (c13, 0, 0, 0, 0, 0, c15, c14),
s5 = (c12, 0, 0, 0, 0, 0, 0, c15), s6 = (c11, c11, c10, c15, c14, 0, c13, c12),
s7 = (c10, c15, c14, c13, c12, 0, c11, c10), s8 = (c9, 0, 0, c9, c8, 0, c10, c9),
s9 = (c8, 0, 0, 0, c15, 0, c12, c11), s10 = (c15, 0, 0, 0, 0, 0, 0, 0),
s11 = (0, 0, 0, 0, 0, c14, 0, 0), s12 = (0, 0, 0, 0, 0, c13, 0, 0),
s13 = (0, 0, 0, 0, 0, c9, 0, 0), s14 = (0, 0, 0, 0, 0, c8, 0, 0)
Z = s1 + s2 + s3 + 2s4 + 2s5 + s6 + s7 + s8 + s9 + s10 − s11 − s12 − s13 − s14
(2)Return Z mod p

A TSMR algorithm on SCA-256 is proposed in this paper to address this problem (Algorithm 3). The first state takes sixteen addition/subtraction operations to calculate , while the second one just costs two to calculate . The intermediate value after two state fast reduction operations is , where , and it only needs one subtraction at most to obtain the final value Z [0, p).

Input: integera a and c = (c15, c14, …, c0) in base 232; .
Output: (c+a) mod p
(1)s1 = (c7, c6, c5, c4, c3, c2, c1, c0); s2 = (c15, 0, 0, 0, 0, 0, 0, c8);
s3 = (c14, 0, 0, c14, c14, 0, c14, c14); s4 = (c13, 0, 0, 0, c13, 0, c13, c13);
s5 = (c12, 0, c15, 0, 0, 0, c15, c15); s6 = (c11, c11, c13, c13, c11, 0, c11, c11);
s7 = (c10, c15, c10, 0, 0, 0, c10, c10); s8 = (c9, c14, c14, c15, c15, 0, c9, c9);
s9 = (c8, 0, 0, c9, c8, 0, 0, 0); s10 = (0, 0, 0, c12, c12, 0, c12, c12);
 s11 = (0, 0, 0, 0, c14, c14, 0, 0); s12 = (0, 0, 0, 0, 0, c9, 0, c8);
 s13 = (0, 0, 0, 0, 0, c13, c13, 0); s14 = (0, 0, 0, 0, 0, c8, 0, c8);
Z1 = s1+ 3s2 + 2s3 + 2s4 + 2s5 + s6 + s7 + s8 + s9 +s10– s11– s12– s13 – s14 – a + p = (r8, r7, r6, r5, r4, r3, r2, r1, r0)
(2)s15 = (r7, r6, r5, r4, r3, r2, r1, r0); s16 = (r8, 0, 0, 0, r8, 0, 0, r8); s17 = (0, 0, 0, 0, 0, r8, 0, 0);
Z2 = s15 – s16 – s17;
Z3 = Z2 – p;
If Z3≥0, return Z3
 else return Z2

In ECADD or ECDBL operation, modular addition (MA) or modular subtraction (MS) operations are always required by the following MM operation. One cycle can be reduced when MA/MS was carried out. The max delay of carry-save addition only cares about the final carry. Therefore, adding one value to the other twenty values will not have a huge impact on latency. As shown in Algorithm 3, operand in previous MA/MS is added to (. In Algorithm 5 proposed below, such an operation appears twice in ECADD (Step 9: T2T2-T4, Step 11: T1T2-T4) and in ECDBL (Step 6: T2T2-T1, Step 8: T1T2-T5), respectively. The clock cycles, , are reduced.

Input: A: 256-bit integer, satisfy A = a1 × 2128 + a0
  B: 256 bit integer, satisfy B = b1 × 2128 + b0.
Output: C: 512 bit product, satisfy C = A × B.
(1)P00 = a0 × b0; asum = a0 + a1;
(2)P11 = a1 × b1; bsum = b0 + b1;
(3)Pss = asum × bsum, C = (P11, P00) – P00 × 2128;
(4)C = C – P11 × 2128;
(5)C = C + Pss × 2128;
(6)return C
3.2. Carry-Save Adder Architecture

In TSFR algorithm, there are five subtraction operations in and one in . In order to reduce the area consumption and clock latency, a kind of new carry-save adder (CSA) architecture is presented for Algorithm 3, and the main advantage of CSA is that it can deal with subtraction operation. The subtraction operation becomes an addition operation by using the subtrahend’s complement.

The first stage reduction result , , was designed as 261-bit data, and it contains 21 operands and 20 256-bit CSAs. Due to one extended sign bit for five subtrahends’ complement, as shown in Figure 2, it is noted that the 20 most significant bits (MSBs) of CSA cannot be cumulated. The CSA of 261 or more bits is not met. As shown in Figure 2, the MSB of could not be got from the sum of to . However, the 256-th to 260-th bits of subtrahend’s complement are set to 1, while the 257-th to 261-th bits of addend are set to 0. The sum of the 256-th to 260-th bits of the subtrahend can be precalculated, getting . Only the low 5 bits () are needed, and it can be placed in row 1 of 1-bit CSA. In this case, the proposed CSA is completed with the function of settling the subtraction operations.

The first stage reduction operation architecture can be divided into two parts: the left part is a 1-bit CSA and the right part is a 32-bit CSA, as shown in Figure 3. To simplify the analysis, 1-bit CSA (1 full adder) is presented by a thin rectangle on the left, while a 32-bit CSA composed of 32 1-bit CSAs is presented by a wider rectangle on the right. For example, the subtraction operation in row 15 of the 32-bit CSA, , is represented by , where , , and . The 32-bit CSA consists of 20 rows and 8 columns, which compute the result . The 1-bit CSA is featured with 5 columns, and each of the columns has 10, 5, 2, 1, and 1 rows, respectively, which compute the result .

The second stage reduction operation is designed to compute the result of , and it has 4 operands at most and needs 4 257-bit CSAs. Two 257-bit CSAs compute , one of which computes , while the other computes . Besides, the remaining two 257-bit CSAs compute , one of which computes , while the other computes .

3.3. Integer Multiplication

Most of the traditional high-performance architectures are based on multipliers. Due to the disadvantages of full-word multipliers, long multiplication should be split into small bits and more operation cycles. Even though the one-cycle 256-bit multiplier in [20] possesses the best speed, it also consumes the most hardware area and the worst latency. To balance hardware consumption and performance, the KO multiplication algorithm based on divide-and-conquer is adopted in this paper, as shown in Algorithm 4:where , . Compared with the cascading 128-bit × 128-bit unsigned multipliers in [16] which use four amounts of half-word multiplication, the KO algorithm just uses three at the cost of one extra full-word subtraction and two extra half-word additions. While the KO algorithm presented in [11] requires six cycles, the KO algorithm presented in [27] requires only five cycles, shown in Algorithm 4.

Input:P1 = (X1, Y1, Z1), P2 = (x2, y2)
Output:P3 = P1 + P2
(1)T1 = Z1Z1
(2)T2 = T1Z1
(3)T1 = T1x2
(4)T2 = T2y2, T1 = T1 − X1
(5)T3 = T1T1, T2 = T2 − Y1
(6)Z3 = Z1T1
(7)T4 = T3T1
(8)T3 = T3X1
(9)T5 = T2T2 − T4, T1 = 3T3
(10)T4 = T4Y1, T1 = T1 − T5
(11)Y3 = T1T2 − T4, X3 = T3 − T1
(12)return (X3, Y3, Z3)
Input:P1 = (X1, Y1, Z1)
Output:P3 = 2P1
(1)T1 = Z1Z1, Y3 = 2Y1
(2)T4 = Y3Y3, T2 = X1 − T1, T1 = X1 + T1
(3)T2 = T2T1
(4)T3 = T4X1, T2 = 3T2
(5)Z3 = Y3Z1, T1 = 2T3
(6)X3 = T2T2 − T1
(7)T3 = T4T4, T1 = T3 − X3
(8)T3 = T3/2, Y3 = T1T2 − T3
(9)return (X3, Y3, Z3)
3.4. Point Addition and Point Doubling

A series of ECADD and ECDBL operations make up PM. For no-idle cycles, a good ECADD and ECDBL algorithm proposed in [27] is chosen for this architecture, given as Algorithm 5 below. The algorithm has three advantages. To be specific, firstly, the multiplication and MR are performing in parallel except for one case. It is noted that the second multiplication of the point addition must wait until the first modular multiplication finishes, because the one input of the second multiplication, multiplier T1, is the output of the first modular multiplication. Secondly, multiplication operation is constantly running, no matter whether shifting from ECDBL to ECDBL or switching between ECADD and ECDBL. Thirdly, hardware consumption is minimized by using only one modular multiplication unit and two modular addition/subtraction units. The proposed high-performance architecture is displayed in Figure 4. The MA/MS unit is designed to perform multiple functions, such as T1 − X1, X1 + T1, 3T1, and Y3/2.

4. Implementation and Validation

The architecture described above is implemented with the Verilog-HDL language. It is synthesized using Design Compilers with the SMIC 130 nm CMOS standard cell library and is evaluated based on the 2-way NAND gate. Apart from that, for comparison with other designs on FPGA platform, it is also implemented on Xilinx Virtex-6 xc6vlx760, using Xilinx ISE 14.7. The performance is obtained by ModelSim simulation. The testing data meet the ECC cryptography protocol and are randomly generated. For a hardware design, the performance and hardware consumption are two main evaluation metrics. Besides, the time-area product is a metric to validate the trade-off between performance and hardware consumption.

With the window NAF recoding method, the time executing point multiplication is denoted aswhere ; refers to the width of NAF; A is the cycle that ECADD required, while D is ECDBL’s cycle consumption. In this work, is set to 4. The calculations of 1P, 3P, 5P, and 7P are precalculated.

Table 1 shows the clocks that are required by each operation. In the fixed point, MM operation uses NAF4 recoding of scalar and takes an average of 14242 cycles by testing 1000 times. After PM operation, two MI operations are required for coordinates conversion from Jacobian coordinates to affine coordinates.

Table 2 shows the comparison among other designs over 256-field-order GF (p). The architecture in [9] is using 256-bit multipliers. In this case, its area is large and there are 659 K gates. As it consumes many large hardware resources, it is not suitable for teleoperated robots. The architecture in [18] relies on two multiplier units using interleaved modular multiplication algorithms. Hence, it is featured with a smaller area but worse computation efficiency. The proposed design is 32.7 times faster in [18]. The architecture in [22] adopts a systolic arithmetic unit and obtains smaller areas but takes more clock cycles. The AT (area-time products) of our architecture are smaller than those of [18, 22].

The design in [28] adopts projective coordinates to avoid MI and employs a radix-2 modular multiplication algorithm for MM. In [29], Shah et al. presented a high-speed processor on the basis of redundant signed digit (RSD) arithmetic to prevent lengthy carry propagation delay. It is able to run at a high frequency of 327 MHz and requires 0.47 ms to perform a single PM operation. The architecture in [11] uses half-word multipliers based on the Barrett modular multiplication algorithm. In [19], a unified architecture of computing MA, MS, and MM is proposed. The designs in [30, 31] only apply adder results in a worse performance than ours. The radix-4 booth encoding interleaved modular multiplication algorithm is adopted in [30, 31]. Besides, the NAF point multiplication algorithm is applied in [31], while the double-and-always-add point multiplication algorithm is employed in [30]. As NAF2 has the merits of decreasing PM complexity from () to (), the design in [30] takes more LUTs to get the comparable clock cycle consumption in the same platform compared with the design in [31]. The architecture proposed here needs fewer clock cycles and is faster when concerning performing point multiplication than those architectures in [11, 18, 19, 21, 30, 31].

The security concern is one of the most important issues in teleoperated robotics systems. In a harsh condition, time delay and power consumption are important, so using hardware to realize cryptographic algorithms has become an imperative tendency. The ECC processor we proposed here is implemented in hardware and can provide a high performance. The most complicated operations, such as PM, PA, and modular operations, are implemented by the hardware proposed here and this hardware module can be called by software to realize digital signature/verification and encryption/decryption to resolve the safety issue of teleoperated systems.

5. Conclusion

In a teleoperated system, robots interact with and are controlled by human operators through a communication network. Therefore, security becomes an import issue and ECC is the well choice among different cryptographic algorithms due to its lower key length. In this work, a high-performance ECC architecture of SM2 is proposed, which is suitable for the teleoperated robot’s security. To reduce latency owing iterated subtractions, a TSMR algorithm on SCA-256 is presented. Thus, the intermediate result is improved when compared with of traditional algorithms. To avoid iterated subtractions, a TSMR algorithm in SCA-256 is shown and implemented with a carry-save adder architecture with the subtraction. To the area/performance trade-off, the half-word multiplier is adopted, equipped with pipeline design fully enhancing the calculation parallelism. The experimental results show that the proposed design takes 0.092 ms to perform 256-bit PM with 153.8 MHz frequency and consumes 341.98 k gate areas. Furthermore, the implementation result indicates that the proposed architecture has better performance and smaller AT than previous works.

In the future, the optimization of modular multiplication will be studied to further reduce the hardware overhead. The portability of the hardware modules and the software-hardware codesign will be further studied to extend the application fields. Antiattack technology is another interesting piece of work worth studying.

Data Availability

The raw/processed data required to reproduce these findings cannot be shared at this time as the data also form part of an ongoing study.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the Independent High Safe Security Chip Research for the Metering and Electricity Program under Grant no. ZBKJXM20180014/SEPRI-K185011, and this support is acknowledged by the authors.