#### Abstract

Discrete cosine transform (DCT) and inverse DCT (IDCT) have been widely used in many image processing systems and real-time computation of nonlinear time series. In this paper, the unified DCT/IDCT algorithm based on the subband decompositions of a signal is proposed. It is derived from the data flow of subband decompositions with factorized coefficient matrices in a recursive manner. The proposed algorithm only requires and multiplication time for -point DCT and IDCT, with a single multiplier and a single processor, respectively. Moreover, the peak signal-to-noise ratio (PSNR) of the proposed algorithm outperforms the conventional DCT/IDCT. As a result, the subband-based approach to DCT/IDCT is preferable to the conventional approach in terms of computational complexity and system performance. The proposed reconfigurable architecture of linear array DCT/IDCT processor has been implemented by FPGA.

#### 1. Introduction

The discrete cosine transform (DCT) first proposed by Ahemd et al. [1] is a Fourier-like transform. While the Fourier transform decomposes a signal into sine and cosine functions, DCT only makes use of cosine functions with the property of high energy compaction. As DCT is preferable for a trade-off between the optimal decorrelation known as the Karhunen-Loève transform and computational simplicity [2], it has been extensively used in many applications [3–14]. In particular, two-dimensional (2D) DCT, such as DCT, has been adopted in some international standards such as JPEG, MPEG, and H.264 [15]. In MP3 audio codec, the subband analysis and synthesis filter banks requires the use of 32-point DCT/integer DCT to expedite computation [16]. Other audio compression standards, for example, the Dolby Digital AC-3 codec, utilize a modified DCT with 256 or 512 data points.

Many algorithms have been proposed for DCT/IDCT [17–21]. In which, the transportation matrix is factorized into products of simpler matrices. It is noted that, however, the factorized matrices are no longer as regular as the fast Fourier transform (FFT); thus, these algorithms can only achieve moderate computational speed. Specifically, the dedicated data paths deduced from the signal flow graphs (SFGs) of the above algorithms need to be optimized for performance enhancement, which is computationally intensive, and the custom-designed DCT is often complicated and cannot be easily scalable for variable data points.

In this paper, we propose a novel linear-array architecture based on the subband decomposition of a signal for scalable DCT/IDCT. The remainder of this paper proceeds as follows. First, the subband-based 8-point DCT/IDCT algorithm [22] is reviewed in Section 2. Its extension to -point DCT/IDCT called the unified subband-based algorithm is proposed in Section 3. Section 4 presents the analysis of system complexity. The reconfigurable architecture of linear-array DCT/IDCT processor implemented by FPGA (field programmable gate array) is proposed in Section 5, and the conclusion can be found in Section 6.

#### 2. The Subband-Based 8-Point DCT/IDCT Algorithm

The discrete cosine transform (DCT) of an 8-point signal, , is defined as where , and for . It can be rewritten in the following matrix form: where , , and the transformation matrix is as follows: where , and .

Let and denote the low-frequency and high-frequency subband signals of , respectively [22], which can be obtained by where . As one can see, the DCT of can be rewritten as where and are the subbands DCT and DST (discrete sine transform) of , respectively. Its vector form is as follows: where and denote the matrices of the subband DCT and DST, respectively, , and . According to (2.4), the 8-point matrix can be written as Due to the orthogonality between and , and can be obtained from by In [22], the multistage subband decomposition is as follows. and denote the DCTs of and , respectively, which can be obtained by and , where is the transformation matrix We have Similarly, the 4-point DCT computations of and can be obtained by for .

The 4-point distributed matrix , can be defined as Let , , , and be the 2-point DCT of , , , and , respectively, which can be computed by using the following 2-point transformation matrix, : We have For the 2-point DCT computations of , , , and , let where , we then have is a 2-point matrix defined as Finally, according to equations (2.6)~(2.33) together with (2.4), we have where The following matrix, , can derived from (2.12)~(2.15). Similarly, the following matrix, , can be derived from (2.21)~(2.28) The decomposition matrix can be defined as and the coefficient matrix can be defined as According to (2.34), (2.38), and (2.39), we have The coefficient matrix can be represented by the reordered coefficient matrix , prepermutation matrix and post-permutation matrix , and can be written as where the matrices , , and can be defined as The reordered coefficient matrix can be represented as The computation of sub-coefficient matrix can be written as where and . The above can be rewritten as [20] The inverse reordered coefficient matrix can be represented as As a result, the total number of multiplications of the subband based 8-point IDCT is only 15.

#### 3. The Unified Subband-Based -Point DCT/IDCT Algorithm

The subband-based DCT algorithm [22] can be unified for -point DCT/IDCT due to the inherent regular pattern. For an -point signal, , the unified subband-based discrete cosine transform can be defined as where , is the decomposition matrix, is the reordered coefficient matrix, is the pre-permutation matrix, and is the post-permutation matrix. The unified decomposition matrix can be written as where the basic matrix, , consists of two submatrices, and As noted, can be represented by the sub-matrices of , or sub-matrices, and as follows:

According to (3.1) and (3.2), the unified distributed matrix can be derived as According to (2.2) and (2.40), we have The 4-point reordered coefficient matrix can be derived as According to (3.5), the 8-point reordered coefficient matrix can be derived as Hence, the unified reordered coefficient matrix can be written as The 4-point pre-permutation matrix and post-permutation matrix can be written as According to (2.50), (2.51), (3.8), and (3.9), the 8-point pre- and post-permutation matrices can be written as Hence, the unified pre- and post-permutation matrices can be represented aswhere denotes .

According to (2.47) and (2.53), the unified subband-based IDCT can be obtained. where is the inverse decomposition matrix, is the inverse coefficient matrix, is the inverse pre-transportation matrix, and is the inverse post-transportation matrix. Note that the decomposition matrix, reordered coefficient matrix, pre- and post-permutation matrices are all orthonormal. Hence, we have

#### 4. Analysis of Computation Complexity and System Performance

Based on the 8-point subband-based DCT and IDCT algorithm, the data flow of parallel-pipelined processing for 8-point DCT and IDCT are described as follows. The data flow of the subband-based 8-point DCT with six pipelined stages is shown in Figure 1. In which, , and , the matrix-vector multiplication of in the first stage, takes one simple-addition time for each element of . The preshuffle performs the prepermutation matrix operation in the second stage. The matrix-vector multiplication is used to compute in the third and fourth stages. In the fifth stage, the postshuffle is used for the post-permutation matrix . The final stage is to compute by using simple shift operation with the Booth recoded algorithm.

Similarly, Figure 2 shows the data flow of the subband-based 8-point IDCT with seven pipelined stages.

In which, , , , and , it performs by using shift operation with the Booth recoded algorithm in the first stage. The preshuffle performs the pre-permutation matrix operation in the second stage. The matrix-vector multiplication is used to compute in the third and fourth stages. In the fifth stage, the post-shuffle performs the post-permutation matrix . The sixth and seventh stages are to perform with simple addition. Both of the subband-based DCT and IDCT algorithms need one multiplication operation with parallel-pipelined processing, in comparison to [23] using linear array, which needs five multiplication operations.

Recall that the DCT of a signal, , can be represented as . The multiplication time of the unified subband-based algorithm can be derived as where .

The log plot of the subband-based DCT computations versus the number of DCT points is shown in Figure 3.

The multiplication time of the unified subband-based DCT with single processor can be derived as follows [23]:

The left side of Figures 4(a), 4(b), 4(c), 4(d), and 4(e) show the original Lena, Baboon, Barbara, Peppers, and Boat images, respectively. The reconstructed Lena, Baboon, Barbara, Peppers and boat image shown in the right side of Figures 4(a), 4(b), 4(c), 4(d), and 4(e), respectively, are obtained by using the proposed subband based 8-point DCT/IDCT algorithm with 32-bit fixed-point operands; the peak signal-to-noise ratios (PSNRs) of Lena, Baboon, Barbara, Peppers, and Boat images are 149.67 dB, 142.12 dB, 143.08 dB, 143.36 dB, and 140.79 dB, respectively.

**(a) Lena**

**(b) Baboon**

**(c) Barbara**

**(d) Peppers**

**(e) Boat**

The PSNR curves of Lena, Baboon, Barbara, Peppers, and Boat images obtained by using the conventional 8-point DCT and the proposed subband-based 8-point DCT at various word lengths are shown in Figure 5. Figures 6(a), 6(b), 6(c), 6(d), and 6(e) show the PSNR curves of Lena, Baboon, Barbara, Peppers, and Boat images obtained by using the conventional DCT and the proposed subband-based DCT with 32-bit operand at various DCT points. As one can see, the subband-based DCT is preferable.

**(a) Lena**

**(b) Baboon**

**(c) Barbara**

**(d) Peppers**

**(e) Boat**

**(a) Lena**

**(b) Baboon**

**(c) Barbara**

**(d) Peppers**

**(e) Boat**

#### 5. FPGA Implementation of the Reconfigurable Linear-Array DCT/IDCT Processor

The reconfigurable architecture of the fast 8-, 16-, 32- and 64-point DCT and IDCT processors based on the subband-based 8-point DCT is proposed in this section.

##### 5.1. The Proposed 8-Point DCT/IDCT Processor

According to the data flow of the subband-based 8-point DCT with six pipelined stages (Figure 1), the architecture of the proposed 8-point DCT processor is shown in Figure 7. In which, the adder array (AA) with three CSA(4,2)s performing the matrix-vector multiplication of is shown in Figure 8. Figure 9 shows the multiplier array (MA) performing three types of operation, which are needed to compute the subcoefficient matrix computation of . The control signals of *swap* and *inv* determine the types of operation. The functions determined by *swap* and *inv* are shown in Table 1. Figure 10 shows the hardwired shifters used for performing by the Booth recoded algorithm [23]. Figure 11 shows the proposed 8-point IDCT processor with seven pipelined stages. In which, the fast adder arrays, shuffle, multiplier array, CLA, and hardwired shifters for DCT architecture can also be used for performing IDCT. The latch array for retiming the input data is shown in Figure 12.

The hardware complexity of the proposed subband-based IDCT architecture is the same as that of the proposed subband-based DCT architecture. Figure 13 shows the proposed integrated 8-point DCT/IDCT processor.

##### 5.2. The Proposed Reconfigurable DCT/IDCT Processor

According to the integrated 8-point DCT/IDCT processor (Figure 13), the proposed reconfigurable 8-, 16-, 32-, and 64-point DCT/IDCT processor is shown in Figure 14. In which, the integrated adder array (IAA) for the fast computation of 8-, 16-, 32-, and 64-point DCT/IDCT is shown in Figure 15. The modified hardwired shifter (MHS) for multiplication by (where ) using the Booth recoded algorithm is shown in Figure 16.

In order to improve the computation efficiency, the number of multiplier arrays should be increased. The log plot of computation cycles versus number of multiplier arrays is shown in Figure 17.

##### 5.3. FPGA Implementation of the Reconfigurable 2D DCT/IDCT Processor

The DCT is defined as [29] where for , and for . It can be rewritten as Thus, the separable 2-D DCT can be obtained by using 1-D DCT as follows: Similarly, the separable 2-D IDCT can be obtained by using 1-D IDCT as follows: As a result, the architecture of 2D DCT/IDCT can be implemented by using two successive 1D DCT/IDCT processors with only one transpose memory [29]. The proposed architecture of 2-D DCT and IDCT is shown in Figure 18. In which, the control signals provided by the finite state machine (FSM) controller are used to manage the data flow and the operation timing for the DCT/IDCT and transpose memory; the transpose memory allows simultaneous read and write operations between the two processors while performing matrix transposition. The data read and written timing diagram for DCT/IDCT system is shown in Figure 19. In comparison with the conventional two transpose memories based 2-D DCT/IDCT architectures, the proposed architecture utilizes only one transpose memory.

The platform for architecture development and verification has been designed as well as implemented in order to evaluate the development cost. The architecture has been implemented on the Xilinx FPGA emulation board [30]. The Xilinx Spartan-3 FPGA has been integrated with the microcontroller (MCU) and I/O interface circuit (USB 2.0) to form the architecture development and verification platform. Figure 20 depicts block diagram and circuit board of the architecture development and evaluation platform. In which, the microcontroller reads data and commands from PC and writes the results back to PC by USB 2.0; the Xilinx Spartan-3 FPGA implements the proposed 2-D DCT/IDCT processor. The hardware code written in Verilog is for PC with the ModelSim simulation tool [31] and Xilinx ISE smart compiler [32]. It is noted that the throughput can be improved by using the proposed architecture while the computation accuracy is the same as that obtained by using the conventional one with the same word length. Thus, the proposed programmable DCT/IDCT architecture is able to improve the power consumption and computation speed significantly. The proposed processor for 8-, 16-, 32-, and 64-point DCT/IDCT is an extension of the 8-point DCT/IDCT processor. Moreover, the reusable intellectual property (IP) DCT/IDCT core has also been implemented in Verilog for the hardware realization. All the control signals are internally generated on chip. The proposed DCT/IDCT processor provides both high throughput and low gate count.

#### 6. Conclusion

With the advantages of the subband decomposition of a signal, a high-efficiency algorithm with pipelined stages has been proposed for fast DCT/IDCT computations. It is noted that the proposed DCT/IDCT algorithm not only simplifies computation complexity but also improves system performance. The PSNR and system complexity of the proposed algorithm is better than those of the previous algorithms [33–36]. Table 2 shows comparisons between the proposed algorithm and architecture and other commonly used algorithms and architectures [24–28]. Thus, the proposed subband-based DCT/IDCT algorithm is suitable for the real-time signal processing applications. The proposed DCT/IDCT processor provides both high throughput and low gate count and has been applied to various images with great satisfactions.

#### Acknowledgment

The National Science Council of Taiwan, under Grants NSC98-2221-E-216-037 and NSC99-2221-E-239-034 supported this work.