#### Abstract

This paper proposes 2 × unrolled high-speed architectures of the MISTY1 block cipher for wireless applications including sensor networks and image encryption. Design space exploration is carried out for 8-round MISTY1 utilizing dual-edge trigger (DET) and single-edge trigger (SET) pipelines to analyze the tradeoff w.r.t. speed/area. The design is primarily based on the optimized implementation of lookup tables (LUTs) for MISTY1 and its core transformation functions. The LUTs are designed by logically formulating S9/S7 s-boxes and FI and {FO + 32-bit XOR} functions with the fine placement of pipelines. Highly efficient and high-speed MISTY1 architectures are thus obtained and implemented on the field-programmable gate array (FPGA), Virtex-7, XC7VX690T. The high-speed/very high-speed MISTY1 architectures acquire throughput values of 25.2/43 Gbps covering an area of 1331/1509 CLB slices, respectively. The proposed MISTY1 architecture outperforms all previous MISTY1 implementations indicating high speed with low area achieving high efficiency value. The proposed architecture had higher efficiency values than the existing AES and Camellia architectures. This signifies the optimizations made for proposed high-speed MISTY1 architectures.

#### 1. Introduction

With the advances in high-speed wireless applications, the quest to provide secure transfer of data has been of major concern [1, 2]. The efforts are underway to provide a real-time encryption solution for high data transmissions with minimum overhead in terms of power [3–5]. This study primarily focuses on high-speed implementations of a 64 bit MISTY1 block cipher for a wide range of applications, i.e., wireless networks, Ethernet devices, image encryption, and radio network controllers (RNCs) [6].

A 64 bit block cipher MISTY1 is an ISO standardized algorithm designed by Mitsubishi Corporation Electric Limited. It is used to handle a 64 bit block of data or less, e.g., 8 byte personal identification numbers (PINs), and is based on a provable 2^{−56} probability against linear/differential cryptanalysis [7–10]. The differential/integral attacks on MISTY1 require large data as well as computational complexities making it practically infeasible for breaking the MISTY1 block cipher. The hardware architecture of MISTY1 and its major subfunctions FO and FI constitute a repetitive loop structure [11]. Therefore, the MISTY1 algorithm is suitable for the implementations of resource-constrained and high-speed applications.

To meet the requirement of the Internet of Things, cryptographic algorithms are frequently optimized for area reduction and high throughput implementation or to achieve a good tradeoff between throughput and speed [12–25]. For low-area design, reutilization/logic optimization methodologies have been widely adopted thereby implementing s-boxes using combinational logic [12–20]. A single-round MISTY1 architecture designed for compact implementation is proposed in [20] consisting of only odd-round functions, i.e., 2 × FL functions, 1 × FO function, and 1 × 32 bit XOR. Later, more compact MISTY1 architectures were proposed comprising only one S9/S7 s-box in the FI function [12]. The compact MISTY1 architectures constitute an area of 3041 and 2331 NAND gates, respectively [12]. Finally, 2 × area-efficient MISTY1 design schemes are proposed in [17] based on the combined substitution unit and threshold throughput requirements. The architectures consist of a very low area of 1853/1546 NAND gates and are the most compact implementations to date. In addition, we analyzed the throughput values of the aforementioned studies and found that the compact MISTY1 architectures attained low throughput values, i.e., ≤500 Mbps, and are therefore unsuitable for high-speed applications [12–14, 17, 20].

Contrary to low-area cryptographic hardware architectures, high-speed encryption algorithms utilize LUTs/RAMs or optimized combinational logic for s-boxes using pipelined schemes [20–25]. In the recent era, the focus of the studies has also shifted on the efficient implementations measured in the form of throughput-to-area ratio. Owing to high-speed and efficient implementation requirements, the architecture presented in [20] utilizes FPGA RAM blocks for the implementation of S7/S9 s-boxes. However, the straightforward implementation of LUTs for S9/S7 s-boxes (given in MISTY1 specifications) and longer path delay where 4 × XOR operations are executed in a single clock cycle followed by RAM resulted in a large circuit area and reduced throughput values. The architecture presented in [21] utilizes the double-edge trigger methodology for MISTY1 high-speed pipeline implementation but has a longer path delay. Moreover, no architectural modifications/structural optimizations are made for high-speed MISTY1 implementation. On the contrary, although the MISTY1 architecture proposed in [22] achieves high speed, it costs a large area implementing a large number of pipelines. In this study, an effort has been made for high-speed and efficient MISTY1 implementation. In the last couple of years, multiple studies have been published regarding different block ciphers. In [26], researchers proposed a block cipher based on the chaotic generator and implemented it on Xilinx FPGA to prove its effectiveness. Similarly, in [27], Muthalagu and Jain took an existing block cipher algorithm and enhanced its performance to reduce the encryption time.

The unique contributions of the proposed MISTY1 *n* = 8-round pipelined architectures are as follows: Optimized implementation of MISTY1 S9/S7 s-boxes and transformation functions, i.e., FL, FI, FO, and 32-bit XOR, by logic formulation of 4, 5, and 6 bit input LUTs for area reduction Designing of MISTY1 and its transformation functions to attribute for the distribution of parallel processing in order to obtain a highly efficient pipelined architecture High-speed exploration of 8-round MISTY1 architectures by employing SET and DET techniques

This paper is organized into five sections with the introduction, i.e., Section 1, followed by optimizations/designing of LUTs for the implementation of MISTY1 transformation functions described in Section 2. Section 3 proposes 2 × high-speed MISTY1 architectures based on SET and DET pipeline schemes. FPGA implementation results/analysis are described in Section 4. Lastly, a brief conclusion is given in Section 5.

#### 2. Optimized Implementation of MISTY1 Transformation Functions

##### 2.1. FI Function

The optimizations made in the design/implementation of the proposed FI function and its constituent S9 and S7 substitution functions are elaborated in Figures 1(a)–1(e). Figures 1(a) and 1(b) depict the FI function and the equivalent FI with modified S9/S7 paths, respectively. The modifications in Figure 1(b) indicate simultaneous execution of leftmost 9 bits and rightmost 7 bits where the subscripts ‘L’ and ‘R’ represent the leftmost and rightmost bits, respectively. *T* stands for the TRUNCATE function, and the plus sign showing the summer function is actually the XOR gate. The XOR gate with KI_{R} is adding on the LSB side to reduce the path delay. The LSB bits are dependent on MSB bits, and the addition of KI_{R} eliminates the dependency of MSB on LSB bits. We have optimized the LUTs of LSB bits by combining S7 and XOR gate. The hardware cost is reduced by the optimization of LUTs for both MSB and LSB sides. In the next step shown as Figure 1(c), the dotted lines of Figure 1(b) are replaced by LUTs {(S9-1 ∼ S9-3), (S9-5 ∼ S9-7), and (S7-1 ∼ S7-3)} concatenated by XOR gates. The upper-left LUTs (S9-1 ∼ S9-3) are described in Table 1 as per the modified logic expressions (i.e., S9 is used in conjunction with the zero-extended XOR operation), whereas lower-left LUTs (S9-5 ∼ S9-7) can be obtained by eliminating (*x*_{10}, *x*_{11}, …, *x*_{16}) bits from the given expressions.

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

The LUTs for (S7-1 ∼ S7-3) are employed as 4 bit and 5 bit input LUTs as described in [21]. In the steps shown in Figures 1(d) and 1(e), the XOR gates of Figure 1(c) are reordered to configure S9-4, S9-8, S7-4, and S7-5 LUTs. The proposed FI function has the primary advantage of reduced LUTs and can be executed in a maximum of 4 clock cycles. Table 2 summarizes the area reduction of 66.7% and 41.3% with the proposed FI function compared to [20, 22], respectively.

##### 2.2. FO Function and 32-Bit XOR

MISTY1 FO transformation function is appended with the 32 bit XOR operation in odd and even rounds (except for the last round) as depicted in Figure 2(a). Therefore, the proposed LUT-based architecture of the FO function comprises {FO + 32 bit XOR}. Figure 2(b) depicts a modified FO function indicating parallel operations for left/right 16 bits. The dotted lines are also mentioned in Figure 2(b), dividing the FO function into 4 sections with each section having side-by-side logic operations. The proposed FO function is deliberated in Figure 2(c) comprising 4 LUT blocks for left and right 16 bits, respectively.

**(a)**

**(b)**

**(c)**

The LUTs of the first and third section include the XOR operations, whereas the second and fourth sections comprise FI functions and XOR operations. However, the left-hand side of the second section symbolized by FI_{1} is composed of (FI + XOR), whereas the right-hand side of the second section includes only the FI function. Similarly, the left-hand side of the fourth section shown as FI_{3} comprises (FI + (2 × XORs)) as compared to the right-hand side XOR operation. Thus, the FI function described in Section 2.1 is modified as per the design requirements of FI_{1} and FI_{3} as shown in Figures 3 and 4, respectively.

It is evident from Figures 3(a)–3(c) and 4(a)–4(c) that changes required to incorporate XORs into the FI function will mainly require the alterations in the last part of the aforementioned FI function. Therefore, new LUTs are added in the lower right part shown as S7-6 and S7-7 for FI_{1} and FI_{3}, respectively. In addition, S9-8 of Figure 1(e) is replaced by newly formed LUTs S9-9 and S9-10 in the lower left section of FI_{1} and FI_{3} functions, respectively.

A uniformly distributed LUT-based FO function and inclusion of 32 bit XOR reduce the (initial) latency as well as the pipeline requirements of proposed MISTY1 architectures. The reduction in pipelines and latency thought is not evident from the figures, yet the proposed implementation significantly reduces the area. Table 3 summarizes the area of (FO + 32 bit XOR) showing 53.3% and 44.4% reduction compared to [20, 22]. The proposed FO function is based on the clock cycle operation required to execute FI_{1}/FI_{2}/FI_{3} functions and will be explained in detail in Section 3.

##### 2.3. Proposed FL Function and Area Estimation of MISTY1 Architectures

A reference FL function is shown in Figure 5(a) followed by Figure 5(b) showing FL-1 and FL-2 representing 4/3 bit input LUTs for left and right 16 bits, respectively. Thus, area for *n* = 8-round MISTY1 architecture can be computed by summation of LUTs required for 10 × FL functions, 8 × (FO + 32 bit XOR) functions, and extended key generation function, i.e., 8 × FI_{2} functions. Table 4 summarizes the area for proposed MISTY1 architectures.

#### 3. Design Space Exploration for High-Speed MISTY1 Architectures

##### 3.1. Architecture 1: DET Pipeline Architecture for High-Speed MISTY1

A high-speed MISTY1 pipelined architecture is shown in Figure 6, whereas the respective FO and FI functions (only the FI_{2} function is shown for reference) are depicted in Figures 7(a) and 7(b). High-speed MISTY1 comprises 8-round architecture with 5-stage and 10-stage pipelines in odd and even rounds, respectively. The number of pipelines in odd and even rounds of MISTY1 is based on the number of clock cycles required to execute FO/FI functions. A double-edge-triggered pipeline is employed with each LUT triggering on alternate clock cycles. This reduces the pipeline requirements of the MISYT1 architecture; however, it has a path delay of 2 × LUTs as mentioned in [11]. The proposed MISTY1 architecture can process 41 × plaintexts and outputs the required ciphertext of 64 bits per clock cycle. Thus, high-speed MISTY1 is obtained with DET pipelines and highly optimized FO/FI function implementations.

**(a)**

**(b)**

##### 3.2. Architecture 2: MISTY1 SET Pipeline Architecture for Very High-Speed MISTY1

Very high-speed MISTY1 and its respective FO and FI functions (FI_{1} and FI_{3} functions are presented here for reference) employing single-edge-triggered pipelines are depicted in Figures 8 and 9.

**(a)**

**(b)**

**(c)**

It is evident that the FI_{1} function requires 4 clock cycles, whereas the corresponding FO function is executed in 9 clock cycles. The pipeline registers are inserted in the FO function as well as MISTY1 architecture to synchronize LSB and MSB bits. The path delay of the SET-based pipelined architecture is 1 × LUT, and therefore, the architecture achieves very high speed. By increasing the pipeline stages, the latency, i.e., the initial ciphertext generation, increases and is found as 77 clock cycles. The proposed architecture is highly suitable for high-speed applications of the order of 40 Gbps.

#### 4. Hardware Implementation Results and Comparison

The proposed MISTY1 high-speed architectures are implemented on FPGA Xilinx Virtex-7, XC7VX690T. The performance comparison/analysis is carried out with existing high-speed Camellia, AES, and MISTY1 architectures. Table 5 depicts the performance parameters, i.e., throughput, area, and efficiency, of the proposed and existing design schemes.

The proposed MISTY1 architectures outperform all previous MISTY1 implementations indicating high speed with low area achieving high efficiency value. The throughput values obtained are 43/25.2 Gbps with a high efficiency of 28.5/18.9 Mbps/slices for very high-speed/high-speed MISTY1 architectures, respectively. For a fair comparison, the referred MISTY1 architectures [20, 22] are implemented using the same FPGA device, i.e., Xilinx Virtex-7. The architectures thus represent highly efficient and high-speed MISTY1 implementations to date. Besides, the proposed architectures have higher efficiency values compared to the existing AES and Camellia architectures (as per our study). This signifies the optimizations made for proposed high-speed MISTY1 architectures.

#### 5. Conclusion

In this paper, we proposed MISTY1 8-round pipelined architectures characterizing high-speed and efficient implementations. The structural optimizations and logic modifications in MISTY1 transformation functions readily reduced the LUTs and pipeline requirements. The proposed high-speed MISTY1 architectures using the SET and DET pipeline explore the speed/area tradeoffs for FPGA implementations. The design/optimization schemes can be extended for the high-speed implementation of the KASUMI algorithm. The high-speed designs have applications in wireless sensor networks, image encryption, and network controllers.

##### 5.1. Future Work

This paper deals only with a high-speed MISTY1 block cipher. In the future, we shall make an energy-efficient MISTY1 block cipher using capacitance scaling, clock gating, clock enable, thermal scaling, voltage scaling, and other energy-efficient techniques. In the future, we shall check the thermal stability of MISTY1. The implementation of the MISTY1 block cipher is on 28 nm technology-based Virtex-7 FPGA in this paper. There is an open scope to reimplement this MISTY1 block cipher design on both 20 nm technology-based Ultrascale Virtex FPGA and 16 nm technology-based Ultrascale Plus Virtex FPGA.

#### Data Availability

The data used to support the findings of this study are included within the article.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.