Abstract
We have presented a memory-less design of the advanced encryption standard (AES) with 8-bit data path for applications of wireless communications. The design uses the minimal 160 clock cycles to process a 128-bit data block. For achieving the requirements of low area cost and high performance, new design methods are used to optimize the MixColumns (MC) and Inverse MixColumns (IMC) and ShiftRows (SR) and Inverse ShiftRows (ISR) transformations. Our methods can efficiently reduce the required clock cycles, critical path delays, and area costs of these transformations compared with previous designs. In chip realization, our design with both encryption and decryption abilities has a 29% area increase but achieves 4.85 times improvement in throughput/area compared with the best 8-bit AES design reported before. For encryption only, our AES occupies 3.5 k gates with the critical delay of 12.5 ns and achieves a throughput of 64 Mbps which is the best design compared with previous encryption-only designs.
1. Introduction
The AES algorithm has been widely used in data transmission in wireless communications [1–3] and RFID applications [4, 5]. The AES design with ASIC chip(s) can achieve the requirements of low cost and high performance. The design with low area cost usually also results in low power consumption. The area reduction of designing the AES can be achieved by optimizing the architectures of its subfunctions [4, 6–11], sharing the same operations of subfunctions [6, 9, 10, 12], and reducing the data path of overall architectures [1, 2, 4–7, 10, 12–15]. The feature of inherently iterative AES algorithm can be exploited to reduce the data path of overall architecture. The data path design of AES can be shrunk to 8-bit versions [1, 2, 4–7, 10, 12, 13] for reducing the area cost. The ASIC design of 8-bit AES reported in [5] has the smallest area cost compared with other versions but also leads to the lowest performance since more clock cycles are needed in encryption and decryption. For the objective of reducing the area cost but still keeping the acceptable performance, the proposed AES uses 8-bit data path and minimum clock cycles to perform the encryption/decryption processes.
For the portability of AES in different platforms and CMOS technologies, our AES uses pure combination logic to design the overall circuit without any memory blocks. The new proposed design methods in major transformations led to the reduction of area cost in AES but still keep the high throughput that meets the requirements of wireless communications. The experiment results show that our AES design has better performance/area ratio compared with previous designs. The remainder of this paper is organized as follows. Section 2 briefly describes the AES algorithm and its transformations. The new designs of transformations and overall AES architecture are proposed in Section 3. Section 4 describes experimental results and comparisons with other previous designs. Finally, conclusions are given in Section 5.
2. AES Algorithm
2.1. AES Algorithm
The AES algorithm for 8-bit data path that processes a 128-bit data block will take at least 160 rounds. The encryption processes perform ShiftRows (SR), SubBytes (SB), MixColumns (MC), and AddRoundKey (ARK) transformations. A separate KeyExpansion (KE) unit is required to generate the Kth round key for each ARK. The decryption process has three reversed transformations, InvShiftRows (ISR), InvSubBytes (ISB), and InvMixColumns (IMC), and one ARK. The normal rounds perform the four inversed transformations. The round keys operated in the decipher process are the reverse of the round keys generated in each round in the cipher process.
2.2. AES Transformations
Four kinds of transformations and one key generation unit in the AES algorithm are described as follows.
(a) SB/ISB Transformations. The transformations are non-linear substitution operations where each byte of the input state is computed with multiplicative inverse (MI) in GF(28) and followed by an affine transformation (AF) over the same field. Similarly, the ISB transformation performs the inverse affine transformation (IAF) followed by the operation of MI in GF(28).
(b) MC/IMC Transformations. The transformations operate column-by-column on the byte array and treat each column as four-term polynomial with coefficients over GF(28). The MC transforms each column to a new one by multiplying it with a constant polynomial modulo . The IMC operation is a multiplication of each column with modulo .
(c) SR/ISR Transformations. The SR transformation rotates the last three rows of the state to the left by one, two, or three bytes depending on the row numbers. The ISR rotates them in the inverse direction of the SR.
(d) ARK Transformation. In each round, the ARK transformation performs an addition of the state with the round key using a bitwise XOR operation.
(e) KE Unit. In each round, the KE unit generates a new 128-bit round key for the XOR operation with the state in the ARK transformation.
3. Design of Our AES Architecture
3.1. Designs of Major Transformations
The optimization of separate transformations focuses on two major transformations, SR and MC, and their inverses, ISR and IMC. The designs of these transformations are described as follows.
(a) The Design of SR/ISR Unit. In this paper, we propose a combined SR/ISR design as shown in Figure 1. It uses twelve 8-bit registers for receiving and storing data from MC or ARK units. The output sequences are generated after performing the SR rotations. Equation (1) shows the original 4 by 4 state matrix and the output state matrix after the SR rotations. The original states in the first row after performing the SR are unchanged. The states in the second, third, and forth rows are rotated by right shifting one, two, and three positions, respectively,

For completing the rotation sequences, several multiplexers are added in Figure 1. The states are inputted to the SR unit by the sequences of their state number. Therefore, the state is the first one that is inputted to the SR in the first clock cycle, and the state is the last input to the SR in the sixteenth clock cycle. The output sequences of the states in the first row are unchanged after performing the SR rotations. The input state , , and are stored in register , , and , respectively, after several clock cycles. The state is stored in register after outputting the state from register . In the second row, the first state is delayed to the last one and other states , , and bypass the state using the multiplexer. These three states are outputted before the state .
The states and in the third row are delayed behind states and after the rotation of the third row. In the fourth row of original state matrix, the state is the last one but becomes the first output state of that row after performing the SR. Similarly, the ISR performs the rotations in the inverse direction of the SR by using the multiplexers to bypass some states for outputting the correct sequences. The design in [13] is the best method to solve the SR and ISR rotations reported so far, but our design can reduce four 8-bit registers and shorten the critical path delay of the unit.
(b) The Design of MC/IMC Unit. In our AES design, the MC, and IMC units are separated to reduce the complexity of the data paths. Equation (2) shows four input states , , , and that multiply constant values , , , and , respectively, in Galois Field GF(28) for generating output states to . The equation also shows that the constant values in the second, third, and fourth rows are rotated to left by one, two, and three positions corresponding to the first row
As shown in Figure 2, our MC design uses eight 8-bit registers to store the states and uses two multiplication units () and () for performing the multiplication in (2). These two multiplication units are realized by simple bit-level XOR operations. The MC design uses two levels of registers. The four registers in the first level receive data from the MC or the ARK units. The second-level registers prepare calculation operations for the outputs. For example, the result of output state is calculated as (). The states , , , and are inputted to registers , , , and , respectively, after four clock cycles. In the next cycle, the four states are stored in registers , respectively, and perform the multiplications () and (). The MC unit outputs the states in the subsequent four clock cycles. At the same time, the next four states are inputted to registers and wait for performing the multiplication with the constant matrix. The MC unit needs sixteen clock cycles to complete the calculation of a 128-bit state. The design in [1, 4–7, 14, 15] is the best method to solve the MC and IMC operations reported before, but our design can further reduce twenty-four 8-bit registers and shorten the critical path delay of the MC unit. The similar optimization results are also obtained in the design of IMC unit.

3.2. The Design of Overall Architecture
We realized iterative AES architecture designs using TSMC 0.18 µm cell library. Figure 3 shows the 8-bit AES processor architecture. A plaintext block and the encryption key are loaded to the AES through the 8-bit input ports data_in and key_in. The enc signal is used to select the encryption or decryption processes. The SB can be realized by the calculation of Multiplicative Inverse (MI) in GF((24)2) and Affine Transformation (AF) units. The ISB can be realized by the same MI calculation with SB and inversed affine transformation (IAF) units. For reducing the area cost of the combined implementations of SB/ISB units, the MI logic is usually shared. The key expansion unit is used to generate and output the required 8-bit round key to the ARK. Since the round keys are in reverse order in decryption, the inverse cipher process can start only after generating the last round key. Afterward, the key expansion with the same round keys can be executed concurrently with the decryption process.

4. Experimental Results
In Table 1, various 8-bit AES designs in different technologies are listed for comparison. The designs in [1, 4, 7, 14] only have the encryption ability. The design in [4] is the encryption-only version of the previous design [5], for application in radio frequency identification (RFID). The design in [5] adopts the clock gating method to reduce the power consumption. One pipeline stage is used to reduce the critical path delay in the SB/ISB design. The SR/ISR units are implemented by random access memory (RAM). In [7], the encryption-only design merges the ARK and SR operations by using four pipeline stages to generate the correct output order of the computed state. The design in [1] provides a low power AES design for the RFID application. It uses gated clock design to reduce unwanted switching activity, the same approach as in [5]. Good and Benaissa [14] proposed a low power/area AES chip design that provides a series of finite-field doubling, tripling, and XOR operations to perform the MC transformation. It also adopts separate data and key memories, the same approach as in [1], for parallel processing the state and round key.
In Table 1, we provide two kinds of implementation information of our AES design including AES with encryption ability and AES with encryption/decryption abilities in the chip level. The chip design is used to compare with other chip results that have their circuits fabricated.
We observe that most realizations are encryption only due to the fast verification of their designs. Most realizations of 8-bit data path AES require more clock cycles to compute a 128-bit data block, resulting in smaller throughput rate. Therefore, most realizations are suitable for those applications with low frequency and throughput rate requirement, such as RFID. On the other hand, our design with higher throughput can be used in applications such as 802.11 series wireless network. Our AES design with only encryption ability occupies 3.5 k gates with the critical delay of 12.5 ns. The major improvement of our AES in this version is to minimize the required number of clock cycles and critical path delay for processing MC and SR operations by our architecture designs.
The area cost and critical path delay of our AES are similar with the best design in [5]. But our design can achieve a throughput of 64 Mbps which is the best design compared with previous encryption-only designs. The area cost of our AES design with both encryption and decryption abilities increases about 29%, but the throughput improves 4.85 times compared with the best design in [5]. From the experimental results, we also observe that our AES design has the best normalized performance of throughput per gate compared with other previous designs.
5. Conclusions
In this paper, we have presented new design methods of AES transformations and their architecture. The major transformations, SR/ISR and MC/IMC, dominate the required clock cycles and path delays for processing the data encryption and decryption. We presented two design methods that can efficiently optimize these transformations, and the proposed architecture design can improve the throughput but keep low area cost compared with other previous designs. The design is suitable for area-limited applications that require high throughput, such as wireless communications. The implementation results demonstrate that the proposed design has the highest throughput with low area cost.
Acknowledgments
This work is supported in part by Taiwan’s Ministry of Education under Project no. 101B-09-027. One of the authors would like to thank Professor Shen-Fu Hsiao for providing the comments.