Abstract

Crystals-Dilithium is one of the digital-signature algorithms in NIST’s ongoing post-quantum cryptography (PQC) standardization final round. Security and computational efficiency concerning software and hardware implementations are the primary criteria for PQC standardization. Many studies were conducted to efficiently apply Dilithium in various environments; however, they are focused on traditionally used PC and 32-bit Advanced RISC Machine (ARM) processors (Cortex-M4). ARMv8-based processors are more advanced embedded microcontrollers (MCUs) and have been widely used for various IoT devices, edge computing devices, and On-Board Units in autonomous driving cars. In this study, we present an efficient Crystals-Dilithium implementation on ARMv8-based MCU. To enhance Dilithium’s performance, we optimize number theoretic transform (NTT)-based polynomial multiplication, the core operation of Dilithium, by leveraging ARMv8’s architectural properties such as large register sets and NEON engine. We apply task parallelism to NTT-based polynomial multiplication using the NEON engine. In addition, we reduced the number of memory accesses during NTT-based polynomial multiplication with the proposed merging and register-holding techniques. Finally, we present an interleaved NTT-based multiplication simultaneously executed with ARM processor and NEON engine. This implementation can further optimize performance by eliminating the ARM processor latency with NEON overheads. Through the proposed optimization methods, for Dilithium 3, we achieved a performance improvement of about 43.83% in key pair generation, 113.25% in signing, and 41.92% in verification compared to the reference implementation submitted to the final round of the NIST PQC competition.

1. Introduction

In the communication network field, sensor nodes and devices use cryptographic protocol with digital-signature and key-exchange algorithms for integrity and confidentiality [1, 2]. With the development of technology, the number of sensor nodes has increased dramatically compared to the past, and accordingly, research on the definition of security standards and vulnerabilities for the increased nodes has been actively conducted [25]. In addition, various applications that use encryption systems have emerged in relation to privacy protection, such as image encryption [6, 7].

However, as Google developed a 72 q-bit quantum computer, a fatal issue arose for the existing cryptographic system. It is solved in polynomial time in a public-key cryptography security system based on the factorization and discrete logarithm using the Shor algorithm [8] within a quantum environment. Recognizing this issue, NIST held the post-quantum cryptography standardization for key encapsulation mechanism (KEM) and digital signature in 2016 to replace the international standard public-key cryptography. Candidates for the final round of the PQC standardization were recently announced. The KEM algorithms are Crystals-Kyber, SABER, NTRU, and Classic McEliece, and the digital signatures are Rainbow, Falcon, and Crystal-Dilithium. Except for Classic McEliece and Rainbow, all finalists use lattice-based cryptography.

The multivariable-based Rainbow has fast signature generation and verification and a very short signature length among the digital-signature algorithms [9]. However, due to the high key generation cost caused by the huge key size of 10 KB or more, mounting Rainbow on constrained embedded devices is challenging. Furthermore, Rainbow’s security is being reconsidered in light of the intersection and rectangular MinRank attacks recently proposed in [10].

Falcon with NTRU lattice is compact and has an efficient operation. Compared to other digital-signature algorithms proposed in PQC competition, it has the shortest key length and the highest verification speed [11]. However, Falcon requires a high key generation cost because it has to solve the NTRU equation. Also, since the floating-point operation was introduced, embedded devices that do not support floating-point operation performed poorly.

Crystals-Dilithium is a lattice-based algorithm that employs the hardness of the Learning With Error (LWE) problem [12]. In addition, compared to other digital-signature algorithms, its key generation performance, signature generation, and signature verification are uniformly distributed. Specifically, in the 2nd Round of the PQC competition, a method for implementing the algorithm more efficiently was proposed, where the security analysis for QROM is well applied to the Crystals-Dilithium, making it the most promising candidate among the final ones [13]. However, it is necessary to consider the cost of the application layer/program for the application of PQC-DSA on secure protocols and block-chain systems. Therefore, optimization of the NTT-based polynomial multiplication algorithm in Crystals-Dilithium is essential.

Optimization studies for PQC in various environments have been conducted. Efforts have been made to mount KEM and digital signatures from Advanced RISC Machine (ARM)-based MCUs to CPU and GPU environments. In general, because the PQC algorithm has a longer key and signature length than ECDSA, research into mounting the PQC algorithm in constrained devices is an important issue in terms of future applicability. ARM cores, which are the most widely used in the embedded environment, are used in a variety of boards, depending on their performance. Since the ARM-Cortex-M4 using ARMv7 was chosen by NIST as one of the performance evaluation equipment of the PQC competition, various optimization studies, including the PQM4 project, have been conducted on ARMv7-based equipment [1416]. Similarly, research on PQC implementation in the ARMv8 environment is ongoing. In ARMv8-based devices, optimization studies on algorithms such as Newhope, a two-round KEM algorithm, SABER, and Kyber were carried out [1719].

ARMv8 is a key device in the Internet of Things (IoT) society, serving as a core MCU for high-end computers in addition to MCUs for mobile, tablet, and desktop computers. As a result, it is expected that the use of ARMv8-based boards will increase in the future. The Jetson Xavier with an ARMv8.2 core is our study’s target device. Jetson Xavier is currently in use for a variety of IoT/Cloud platforms, including autonomous driving environments that require digital signatures. As a result, in this study, we present an optimized Crystals-Dilithium implementation in the ARMv8 environment. We propose the parallel logic of the NTT-based polynomial multiplication algorithm, which is the core operation of Crystals-Dilithium, by fully utilizing the ARM processor and NEON engine.

1.1. Contribution
1.1.1. First Work of Crystals-Dilithium on ARMv8

Until now, an official Crystals-Dilithium study has only been conducted in ARM-Cortex-M3 and ARM-Cortex-M4 [15], but optimization studies on more diverse platforms are needed before it can be used in real-world applications. The ARMv8-A series, in particular, is being developed not only as a core MCU for mobiles and tablets, but also as an MCU for autonomous driving and high-end computers. Since the ARMv8-A series is becoming more popular as a core MCU in the embedded industry, optimization studies of PQC-based digital signatures in the ARMv8-A series should be considered. For the first time, we discuss in-depth optimization of NTT multiplication, the core operation of Crystals-Dilithium in the ARMv8-A series. As a result of the proposed methods, our Crystals-Dilithium software improved its performance by 43.83% in KeyGen, 113.25% in Sign, and 41.92% in Verify when compared to previous research [13] based on Crystals-Dilithium level 3.

1.1.2. Proposing Memory Optimization Techniques

Memory access not only has a high-performance overhead in an embedded environment, but it is also an expensive instruction. As a result, our goal is to reduce the number of memory accesses. We present an optimal path for NTT to minimize memory access in it. Memory access was minimized from Depth 0 to 2 in the NTT using the merging method, which was able to reduce memory access instructions ( and ) by about 32 when compared to the standard implementation. Furthermore, for Depths 2–7 in the NTT, all coefficients required for conversion are stored in NEON vector registers and held until the NTT is completed. Because all coefficients are handled by holding into vector registers, the register hold technique has the advantage of avoiding memory access. We reduce the number of memory accesses that can occur in the NTT by using the proposed memory optimization techniques.

1.1.3. Optimizing NTTs for ARMv8

We present a method for designing the NTT multiplication of Crystals-Dilithium considering the resources of the ARMv8-A series. Our target device has ARM processor modules and a NEON engine, and we optimize NTT multiplication leveraging these features. We present NEON-based and ARM-based butterfly methods for NTT and inverse NTTs, respectively. The NEON-based butterfly method uses an Advanced Single Instruction Multiple Data (ASIMD) instruction and a vector register to efficiently perform four coefficients in parallel. The ARM-based butterfly method employs a barrel shifter to process two coefficients. ARM processors are not as powerful as the NEON engine, but they are adequate for small tasks. Finally, we combine all of the butterfly method implementations. This software converts the latencies of ARM operations into NEON overheads, improving performance even further. The same optimization technique as described above is used in point-by-point multiplication. We achieved performance improvements of 251% in NTT, 20% in point-wise multiplication, and 304% in inverse NTT using the proposed methods, and NTT multiplication overall achieved a 260% performance improvement compared to previous work [13].

1.2. The Necessity for PQC-DSA in the 5G Communication Network

As social networks based on the 5G industry develop, so does the importance of communication security and personal information protection. Social network websites and applications are actively used in the user’s closest space through IoT devices. For the security of communications in SocialNet-oriented cyberspace, we consider three things in this article.

The first is to minimize the load on the cryptographic algorithm. Users want more rapid responses; however, the cost of using a cryptographic system is fixed. Therefore, it is important to optimize the key-exchange and digital-signature algorithms used for cryptographic protocols. The second consideration is the IoT devices used in SocialNet-oriented cyberspace. In mesh with the first consideration, we implement cryptographic algorithms to match the characteristics of IoT devices. Optimization methods that take into account the characteristics of the device further accelerate the cryptographic algorithm. Finally, we consider the security of future-oriented communications. Currently, the 5G industry is accelerating, and various countries have started to develop the 6G industry. In addition, the PQC algorithm must be mounted on the cryptographic protocol to address the threat of quantum computing.

Therefore, In this article, we propose an implementation of the PQC-DSA algorithm, Crystals-Dilithium, optimization on the ARMv8 platforms used in the most popular. Our research accelerates the speed of mobile Internet and social networks.

1.3. Organization

The rest of the study is summarized as follows. Section 2 introduces the Crystals-Dilithium and analysis profiling of the reference code, as well as a description of the ARMv8 platforms. Section 3 discusses and analyzes existing implementation research for PQC. Section 4 presents an optimized NTT implementation on ARMv8-A series. Section 5 evaluates our works. Finally, in Section 6, we conclude this study.

(1)
(2) : = 
(3) : = 
(4) : = 
(5) : = 
(6) : = 
(7)return
(1): = 
(2) : = 
(3) : = 0, : = 
(4)while: =  do
(5) : = 
(6)
(7)
(8): = 
(9): = 
(10): = 
(11)if or or then: = 
(12) else
(13)  : = 
(14)  if or the # of 1’s in is greater than then: = 
(15)
(16)return
(1) : = 
(2) : = 
(3) : = 
(4)return and : =  and # of 1’s in is

2. Preliminaries

2.1. Crystals-Dilithium

Crystals-Dilithium is one of the most promising digital-signature algorithm candidates for the NIST PQC conference’s final algorithm. Crystals-Dilithium is based on the difficulty of the Module Learning with Error problem and shares basic characteristics and structure with Crystals-Kyber [12, 13]. Crystals-Dilithium employs Fiat–Shamir with an abort method and borrows Module-LWE; as a result, it provides a higher level of security than other ring-LWE-based ciphers. Furthermore, for all security levels, Crystals-Dilithium employs the same ring and modulus. This has an advantage in terms of implementation over other competitors.

The polynomial ring used by Dilithium is , where is , and the parameters are maintained by simply changing the dimension of the public matrix according to the security level. Therefore, the core process of Crystals-Dilithium is the operation to generate the open matrix A and polynomial multiplication to generate the LWE-based problem. Similar to the general digital-signature algorithm, the structure of Crystals-Dilithium consists of , , and processes.

The process of Crystals-Dilithium is depicted in Algorithm 1. Using random seeds and , this process generates the public matrix A as well as the secret information and . Through SHAKE-based rejection sampling, the operation extracts a very small range of numbers. In all algorithms, the SHAKE algorithm serves as the collision-resistant hash .

Algorithm 2 depicts the Crystals-Dilithium’s Process. Because the size of the public matrix in Crystals-Dilithium is greater than 1 KB, it is more efficient to regenerate the public matrix via during the process. A masking vector for polynomial is generated during the signing process, and is calculated. In this case, a challenge is generated by hashing the message with , which is the ’s high-order bit. In the Verify process, and are in charge of reconstructing the bits. The public key can be reduced by about 2.5 times using this method, at the cost of a slight increase in signature size.

Algorithm 3 depicts the Crystals-Dilithium’s process. The signature verifier determines whether is accepted and whether the signature is within the acceptable range.

2.2. Core Operation
2.2.1. NTT-Based Multiplication

The number theoretic transform (NTT) is a variant of the fast Fourier transform and an algorithm used by many lattice-based cryptographic algorithms that have advanced to the PQC contest 3 Round [20]. NTT’s main feature is that it reduces the complexity of polynomial multiplication from to . NTT divides polynomials to the smallest unit through -th of unity and performs point-wise multiplication in complexity using the divide and conquer algorithm. Finally, it entails transforming the result into an complexity coefficient representation. The condition that NTT can be used in a general polynomial ring using is that must be a power of 2, and must be congruent to 1 modulo 2 . Based on the NTT, the formula for polynomial multiplication is

, where is the point-wise multiplication of the coefficients.

The depth of logn of NTT varies depending on the polynomial ring used. Due to the size of q, first-order multiplication must be performed during point-wise multiplication in the case of Crystals-Kyber, which is the same family of Crystals algorithms. However, in the case of Crystals-Dilithium, is and is 256; hence, NTT can be performed up to 32-bit coefficients. Finally, because Crystals-Dilithium can transform up to the 8-th square root, 32-bit multiplication occurs 256 times for point-wise multiplication.

There are numerous methods for computing the NTT. Crystals-Dilithium employs the bit-reverse-based algorithms Cooley–Tukey [21] and Gentlemen–Sande [22]. The butterfly’s method is used to match the two transform methods, and because both algorithms are bit-reverse based, there is no need to invert additional bits.

2.2.2. Profiling of Crystals-Dilithium Reference Codes

In this section, we profile the final submission Crystals-Dilithium’s code and discuss the optimization strategy of this study. Reference code is compiled in the Jetson Xavier with ARMv8.2, our target platform. Although reference code has reference and optimization implementations, AVX2-based source code cannot be built in the ARMv8 environment; additionally, as far as we know, the Crystals-Dilithium development team has not implemented code in the ARMv8-A series. As a result, the code submitted to the finals is the best option.

Table 1 shows a profiling performance of Crystals-Dilithium reference code on Jetson Xavier. The Crystals-Dilithium algorithm is made up of three parts: , , and . The operation performs rejection sampling based on SHA-3, and it is a common operation in each Dilithium component. Because the Keccak algorithm is used repeatedly in rejection sampling, it requires a lot of computation resources. According to our findings, the operation took 46.4%, 29.8%, and 45.4% of the computational load in the , , and processes, respectively. As a result, an optimization method capable of efficiently parallelizing rejection sampling is required. Optimization studies for the Keccak algorithm in embedded devices exist [25], but studies on optimal implementations in ARMv8 environments do not exist, to the best of our knowledge. In order to reduce the performance load of in the ARMv8 environment, it is recommended to use the fully assembled XKCP library made by the Keccak development team. This study does not take this into account.

Aside from the operation, the NTT and the point-wise multiplication process consume the most computational resources in Crystals-Dilithium. The NTT and point-by-point multiplication processes are responsible for 23.5%, 65.7%, and 50.4% of the , , and processes, respectively. Despite the fact that NTT -based multiplication is the fastest polynomial multiplication method, the public matrix has a maximum size of (8,7) and thus incurs a performance load.

Accordingly, in Jetson Xavier, the Crystals-Dilithium implementation logic must be redesigned by considering registers and instruction sets. As much information as possible should be kept in a small number of registers, operations should be performed, and the memory access cycle should be as short as possible. Furthermore, because some embedded processors support parallel instruction sets, this must be considered when determining the optimal load. In Section 4, we propose optimization methods for Crystals-Dilithium in the ARMv8 environment.

2.3. Target Devices: ARMv8-A Processor

ARM is widely used in the embedded industry due to its low power consumption and high performance when compared to previous low-end processors, AVR and MSP. According to their performance, ARM processors are classified into M-series, R-series, and A series levels. Among them, the ARM-A series provides the best performance. Furthermore, the most recent version of an ARM processor is the ARMv8 architecture.

The ARMv8-A series includes an ARM processor as well as a NEON engine. Unlike the NEON engine, the ARM processor does not support parallel processing, but it is adequate for small tasks. Furthermore, the ARM processor includes a barrel shifter, which can hide clock cycles for shift operations in the operand, making it a very powerful technology. The ARM processor’s register structure is made up of 64-bit general-purpose registers -, and an A64 instruction set architecture [23] is provided. The NEON engine is a powerful parallel processing engine that supports 128bit vector register - and ASIMD instructions set architecture [24]. Within a 128-bit vector register, this parallel processing can be done in 64-bit, 32-bit, 16-bit, and 8-bit units.

Furthermore, the ARM processor and NEON engine are separate modules that operate independently of one another. In other words, for the sequential instruction order of an ARM/NEON processor, it is the sum of the execution times of the ARM/NEON processor, but for the interleaving approach, the pipeline stall of each instruction can be hidden and performance-optimized efficiently [26]. Table 2 describes the ARM/NEON instructions and clock cycles used in this study to optimize NTT multiplication.

Since the proposal of Crystals-Dilithium, implementation studies on Crystals-Dilithium in various embedded environments have been conducted. Submissions of current quantum-resistant cryptography implementations are mostly done on a CPU. The implementations used Intel instructions or the AVX2 parallel processing instruction. In the CPU environment, quantum-resistant cryptography implementations show no significant difference or slower performance than the elliptic curve-cryptographic system.

The environment in which the actual encryption equipment is used, on the other hand, is primarily comprised of low-spec embedded equipment. Because these devices have limited flash memory, RAM, and operation speed, it is critical to investigate the optimization of quantum-resistant cryptography’s core operations. There are implementation results for the ARM-Cortex-M4 targeted by the NIST software performance evaluation model. An optimization study for Crystals-Dilithium in ARM-Cortex M3 and M4 environments was proposed at CHES’2021 [15]. By converting the unsigned expression to the signed expression in the M4 environment, the additional operation that prevents negative representation from appearing in the positive representation is omitted. It was also implemented by combining two NTT layers and maximizing SIMD instructions to fit the M3/M4 environment. The NTT process improved performance by reducing data access to a bare minimum through the integration of two layers. Finally, [15] presented three implementation strategies based on public and secret information storage space.

Except for Crystals-Dilithium in the ARMv8 environment, other PQC optimization studies have been conducted on Newhope, Crystals-Kyber, and SABER. An optimization implementation study for Newhope in the ARMv8 environment was carried out in 2017 [9]. Currently, Newhope is a PQC alternative candidate, and an ARMv8-based parallel-based NTT multiplication implementation is relevant to our research. To reduce NEON instruction and computational division, an unsigned 16-bit integer representation is used. Furthermore, the parallel logic is newly designed so that no conversion to the Montgomery domain is required, and the existing load of Barrett-reduction was removed by suggesting a method to perform subtraction in the multiplication process by point. Through this, [17] achieved an 8.3 times performance improvement over the existing C-based reference implementation in the ARM-Cortex-A53 core.

Crystals-Kyber implementation in the ARMv8 environment has recently been proposed [18]. The NTT operation for 16-bit coefficients is optimized in the same way that Newhope is. Because the modulus q is different, the techniques used in Newhope cannot be used. Vectorization is used to optimize almost all core operations, including sampling and reduction operations. It also accelerates Crystals-Kyber’s core work of symmetric functions via ARMv8 cryptographic extensions. [18] achieved a performance improvement of up to 8.6 times over the reference code through this optimization study.

Chung et al. [19] proposed a method for applying the NTT to the SABER polynomial ring of power of 2. This resulted in a performance improvement of about 60% in SABER when compared to Toom–Cook-based multiplication. Benchmarking and research of SABER using NEON in the ARM-Cortex A series was carried out in [27]. The NTT technique and the NEON instruction proposed in [19] were used, and benchmarking was performed on the Apple M1 core and the ARM-Cortex A72.

4. Optimization Strategies of NTT

In this section, we present an optimized method of NTT multiplication, the core operation of Crystals-Dilithium, to accelerate signature processing in the ARMv8-A series. Because NTT multiplication is divided into NTT, inverse NTT, and point-wise multiplication, we introduce detailed optimization strategies by categorizing it as NTT/InvNTT and point-wise multiplication. We present a memory optimization technique and a parallel optimization method in NTT and inverse NTT. Furthermore, we present the interleaving concept of the butterfly method, which was codesigned with the ARM/NEON processor. In the point-wise multiplication, the same optimized methods used in NTT and inverse NTTs are used except for the memory optimization.

4.1. and

The most computationally expensive parts of NTT multiplication are the NTT and inverse NTTs. Due to the limited resources of the embedded environment, the amount of memory access varies depending on how it is implemented, and memory access is an expensive instruction in the embedded environment. As a result, we present memory optimization techniques, such as merging and register-holding, to reduce these memory accesses. In addition, we present an efficient parallel implementation using the NEON engine. Finally, by utilizing the target devices’ independent cores, we further optimize our parallel implementation using the butterfly method, which was codesigned with the ARM/NEON processor. Figure 1 depicts the overall structure of the proposed optimization techniques for the NTT. As a result, not only the memory access was minimized through the merging and register-holding method, but also the performance was further enhanced by processing the NTT of some coefficients by the ARM processor, concealing the latencies of some ARM operations with NEON overheads.

4.1.1. Memory Optimization: Merging and Holding

Figure 2 shows a comparison of standard and merging implementations. At each depth, the standard implementation performs the butterfly method sequentially. Given the target device’s resources, this implementation necessitates multiple memory accesses. The standard implementation is distinguished by the fact that it necessitates multiple load and stores instructions for each depth. Specifically, 48 and instructions are required to convert the 256-degree polynomial to the 64-degree polynomial. Because these memory access instructions are very expensive in the embedded environment, memory accesses must be reduced to minimize costs. The primary goal of merging implementation is to reduce memory accesses. The merging implementation, on the other hand, performs NTT of 1 depth without memory access by concurrently processing coefficients of a specific order required for the butterfly method of the next depth. Many and instructions are saved as a result of this. Since the 256-degree polynomial is transformed without memory access up until the 64-degree polynomial, 16 and instructions are saved using the merging method. As a result, in an embedded environment, this merging method is an efficient way to reduce memory access overheads. Following that, the register-holding method is used to minimize memory accesses until the NTT is completed. In other words, all coefficients of the 64-degree polynomial are stored in vector registers -, and operations are performed by holding them in the register without accessing memory until the NTT is completed.

4.1.2. Butterfly Method on the ARM Processor

Barrel shifter and 64-bit general-purpose registers are supported by ARM processors. Furthermore, due to backward compatibility with the previous version, the lower part of the 64-bit can be used as a 32-bit general-purpose register. Using these features, we describe the ARM-based butterfly method for processing two coefficients. Algorithm 4 shows how the proposed ARM-based butterfly method works with two coefficients. Step 3 involves performing signed multiplication on the input and Zetas, where Zetas is the twiddle factor of NTT.

Step 4 is a signed multiplication of Zetas, the first operation of the Butterfly operation, and one input. Because a signed multiplication was performed, a Montgomery reduction is required to return it to the ring’s elements. Steps 5–9 are a proposed Montgomery reduction based on an ARM processor, and we process the multiplication and subtraction operations required for Montgomery reduction at the same time using the instruction. Steps 8–9 perform the remaining addition and subtraction operations of the butterfly operation, and the butterfly operation for each coefficient of the inputs is completed.

To prepare for the next butterfly method, the upper 32-bit is shifted to the lower 32-bit using the barrel shifter in steps 11–12. The ARM-based butterfly method described above does the same thing in the following step to process one coefficient from each remaining input. Finally, to reduce memory access, we concatenate two coefficients on which the butterfly method is performed into a 64-bit general-purpose register, and the concatenated process can be simply performed with an ARM processor’s barrel shifter.

(1)Input:in1 (x8), in2 (x9)             ⊳ 4 (in1:2, int2:2) coefficient on ARM
(2)Output: out1, out2             ⊳ 4 (out1:2, out2:2) coefficient on ARM
(3)// Butterfly 1
(4)             ⊳  = in2 ( : 1 coefficient ) zetas
(5)             ⊳
(6)             ⊳
(7)             ⊳
(8)             ⊳ Addition: in1
(9)             ⊳ Subtraction: in1
(10)// Masking
(11)             ⊳ (in1: 1 coefficient ) = 
(12)             ⊳ (in2: 1 coefficient ) = 
(13)// Butterfly 2
(14)             ⊳  = in2 ( : 1 coefficient ) zetas
(15)             ⊳
(16)             ⊳
(17)             ⊳
(18)             ⊳ Addition: in1
(19)             ⊳ Subtraction: in1
(20)// Concatenation
(21)             ⊳
(22)             ⊳
(1)Input:in1, in2             ⊳ 8 (in1:4, int2:4) coefficient on NEON
(2)Output: out1, out2             ⊳ 8 (out1:4, out2:4) coefficient on NEON
(3)// Zetas(twiddle factor) multiplication
(4)             ⊳ in2 (2 coefficient ) zetas
(5)             ⊳ in2 (2 coefficient ) zetas
(6)// Masking
(7)             ⊳ Narrow Extract(Lower)
(8)             ⊳ Narrow Extract(Upper)
(9)// Montgomery reduction
(10)             ⊳ in2 (4 coefficient)
(11)             ⊳ (2 coefficient )
(12)             ⊳
(13)             ⊳ (2 coefficient )
(14)             ⊳
(15)// Masking, Addition, and Subtraction
(16)             ⊳ Narrow Extract ( : Lower)
(17)             ⊳ Narrow Extract ( :Upper)
(18)             ⊳ Addition of Butterfly
(19)             ⊳ subtraction of Butterfly
4.1.3. Butterfly Method on the NEON Engine

ARMv8-A series supports a powerful NEON engine, which is a SIMD instruction architecture. The NEON engine provides vectorization of 16 8-bit, 8 16-bit, 4 32-bit, and 2 64-bit within a 128-bit vector register. Since the modulus of Crystals-Dilithium is 8380417, each coefficient of the polynomial should be an element within the modulus 8380417. Considering the modulus of Crystals-Dilithium and the parallel unit of NEON engine, we present task parallelism for the NTT multiplication of Crystals-Dilithium leveraging the NEON engine. Within a 128-bit vector register, it processes four coefficients at the same time. The task parallelism for the butterfly method, which is the basic operation of NTT multiplication, is demonstrated in Algorithm 5. In steps 4-5, a signed multiplication is performed between one input and zetas. The (or ) instruction performs signed multiplication between the upper (or lower) two 32-bits of two 128-bit vector registers and stores the result in two 64-bit vector registers. Following that, Montgomery reduction is used to make it a part of the ring. Because Montgomery reduction only requires the lower 32-bits of the multiplication result, steps 7–8 collect the 32-bits of the multiplication result into one vector register using and instructions. instructions extract narrow within the vector register and are divided into and instructions based on upper and lower.

In steps 10–14, Montgomery reduction is performed in parallel for four coefficients. Step 10 is a step to perform multiplication with QINV, which is one of the Montgomery reduction steps. Through the previous masking process, we optimize it to process four multiplications at the same time. Furthermore, and instructions are similar to and instructions in that multiplication and subtraction can be performed in the same clock cycle. This allows us to improve the performance of NEON-based Montgomery reduction. The result of the Montgomery reduction is collected into a single 128-bit vector register in steps 16–17 using the proposed masking process. Finally, the NEON-based butterfly method with task parallelism is completed by performing the remaining butterfly method operations of addition and subtraction.

4.1.4. Interleaving Butterfly Method Utilizing ARM/NEON

The ARMv8-A series has two cores: an ARM processor and a NEON engine. The two cores are independent modules that compute independently of one another. The ARM processor is not as powerful as the NEON engine, but it is adequate for some minor tasks. As a result, we present the butterfly method, which was developed in collaboration with the ARM/NEON processor. This codesign aims at interleaving rather than serializing each implementation of our butterfly method of the ARM processor and NEON engine. Figure 3 depicts the processing of the butterfly method, which was codesigned with an ARM/NEON processor. By utilizing both ARM/NEON processors concurrently, the latencies of some coefficient operations in the ARM processor are effectively hidden by NEON overheads, allowing performance to be further maximized than utilizing a simple single core.

4.2. Point-Wise Multiplication on ARMv8

Point-wise multiplication is a modular multiplication process that consists of simple multiplication followed by Montgomery reduction, similar to zetas multiplication followed by Montgomery reduction in butterfly operations. Thus, modular multiplication can be implemented by performing multiplication with one coefficient instead of zetas multiplication and then performing Montgomery reduction. Except for the memory optimization in the butterfly operation, the optimization method in point-wise multiplication uses only the parallel and interleaving methods. Figure 4 depicts the optimization method proposed in the point-wise multiplication process. The point-wise multiplication process, like the interleaving butterfly method, employs both the ARM core and the NEON engine concurrently. The NEON engine processes four coefficients in parallel, whereas the ARM engine processes two coefficients and mixes them. The interleaving implementation improves performance by incorporating ARM computation latency into NEON overheads. Furthermore, we can minimize pipeline stalls in each implementation and use both cores, including parallel implementation and barrel shifter.

5. Evaluation

5.1. Jetson Xavier

The Jetson Xavier CPU has 8 ARMv8.2 cores, and the same out-of-order pipeline as ARMv8. ARMv8.2 supports half-precision floating-point processing, RAS, statistical profiling, and an improved memory model architecture when compared to ARMv8 [28]. It has a 64 KB L1 data cache, a 128 KB L1 instruction cache, and a 2 MB L2 cache, and it can run at up to 2.26 GHz. The software is compiled with GCC with the -O3 option, and as a result, the benchmarking reference code uses the NEON engine partially by the compiler for each function. For benchmarking, the reference code and our code are executed 10,000 times and the clock cycles on the registers in ARMv8.2 are measured. A Crystals-Dilithium submission serves as the reference implementation [13].

5.2. Results for and

Table 3 compares the performance of NTT/InvNTT and point-wise multiplication in ARMv8.2-based Jetson Xavier between the reference implementation and the presented implementation. Except for the multiplication of the twiddle factor in the NTT/InvNTT conversion process, the reference implementation was partially executed in parallel through the NEON engine in addition and subtraction. In our work, we use the merging, register-holding, and interleaving methods to reduce memory access to the input-polynomial and compactly compare the execution cycle; as a result, we achieve performance improvements of about 251% and 304% in the NTT/InvNTTs, respectively. All processes for the reference code of point-wise multiplication were carried out in full parallel via the NEON engine. We achieve a 20% performance improvement on point-wise multiplication by compactly using interleaving and vector registers of the NEON engine. Finally, we achieved a 260% percent performance improvement over the reference implementation in full NTT-based multiplication.

5.3. Results for Full Schemes

We achieved performance improvements of approximately 43.83%, 113.25%, and 41.92% in , , and based on Crystals-Dilithium security level 3, respectively, using our NTT-based multiplication optimization method. Furthermore, at all security levels, it outperforms the reference implementation. To the best of our knowledge, this is the first implementation of Crystals-Dilithium optimization in an ARMv8 environment. Additionally, we compare our results with another finalist algorithm, Falcon. In official reference implementations, Crystals-Dilithium always outperforms Falcon in the and process. Our implementation further enhances the performance advantages of Crystals-Dilithium and minimizes the performance gap that occurred during the Verify process compared to Falcon.

6. Conclusion

We present three implementation strategies for high-speed NTT implementation in an ARMv8 environment, merging, register-holding, and interleaving, and demonstrate them in Crystals-Dilithium. We achieve extremely fast implementations in ARMv8 platforms as a result of this, making Crystals-Dilithium a very efficient candidate in the ARMv8 environment. The parallel load proposal, the use of barrel shifters, and the use of the interleaving technique, in particular, are very well-suited implementations for ARM-based platforms. We achieved 43.83%, 113.25%, and 41.92% in , , and , respectively, compared to the reference implementation of Crystals-Dilithium security level 3 in the ARMv8 environment.

More broadly, we believe that the approach of merging multiple NTT layers, register-holding for the remaining layers, and finally interleaving can be applied to the ring of PQC variables. It can be used in particular when other PQC algorithms that have selected NTT high-speed implementation, such as Crystals-Kyber and Falcon, are implemented in an ARMv8 environment by selecting a special ring. From the standpoint of implementation design, it is intriguing that the NEON engine and the ARM processor collaborate with each other via the new parallel load and reduce memory access because it can lower computation costs and define a compact routine that resynchronizes the algorithm-specific NTT layers.

Data Availability

The “source code data” and “optimization method data” used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

This work was partly supported by the Institute of Information and Communications Technology Planning and Evaluation (IITP) grant funded by the Korea government (MSIT) (No. A2021-0270, 6G autonomous security internalization-based technology research to ensure security quality at all times, 100%) and Korea Evaluation Institute of Industrial Technology (KEIT) grant funded by the Korea government (MOTIE).