Research Article  Open Access
A Unified Algorithm for SubbandBased Discrete Cosine Transform
Abstract
Discrete cosine transform (DCT) and inverse DCT (IDCT) have been widely used in many image processing systems and realtime computation of nonlinear time series. In this paper, the unified DCT/IDCT algorithm based on the subband decompositions of a signal is proposed. It is derived from the data flow of subband decompositions with factorized coefficient matrices in a recursive manner. The proposed algorithm only requires and multiplication time for point DCT and IDCT, with a single multiplier and a single processor, respectively. Moreover, the peak signaltonoise ratio (PSNR) of the proposed algorithm outperforms the conventional DCT/IDCT. As a result, the subbandbased approach to DCT/IDCT is preferable to the conventional approach in terms of computational complexity and system performance. The proposed reconfigurable architecture of linear array DCT/IDCT processor has been implemented by FPGA.
1. Introduction
The discrete cosine transform (DCT) first proposed by Ahemd et al. [1] is a Fourierlike transform. While the Fourier transform decomposes a signal into sine and cosine functions, DCT only makes use of cosine functions with the property of high energy compaction. As DCT is preferable for a tradeoff between the optimal decorrelation known as the KarhunenLoève transform and computational simplicity [2], it has been extensively used in many applications [3–14]. In particular, twodimensional (2D) DCT, such as DCT, has been adopted in some international standards such as JPEG, MPEG, and H.264 [15]. In MP3 audio codec, the subband analysis and synthesis filter banks requires the use of 32point DCT/integer DCT to expedite computation [16]. Other audio compression standards, for example, the Dolby Digital AC3 codec, utilize a modified DCT with 256 or 512 data points.
Many algorithms have been proposed for DCT/IDCT [17–21]. In which, the transportation matrix is factorized into products of simpler matrices. It is noted that, however, the factorized matrices are no longer as regular as the fast Fourier transform (FFT); thus, these algorithms can only achieve moderate computational speed. Specifically, the dedicated data paths deduced from the signal flow graphs (SFGs) of the above algorithms need to be optimized for performance enhancement, which is computationally intensive, and the customdesigned DCT is often complicated and cannot be easily scalable for variable data points.
In this paper, we propose a novel lineararray architecture based on the subband decomposition of a signal for scalable DCT/IDCT. The remainder of this paper proceeds as follows. First, the subbandbased 8point DCT/IDCT algorithm [22] is reviewed in Section 2. Its extension to point DCT/IDCT called the unified subbandbased algorithm is proposed in Section 3. Section 4 presents the analysis of system complexity. The reconfigurable architecture of lineararray DCT/IDCT processor implemented by FPGA (field programmable gate array) is proposed in Section 5, and the conclusion can be found in Section 6.
2. The SubbandBased 8Point DCT/IDCT Algorithm
The discrete cosine transform (DCT) of an 8point signal, , is defined as where , and for . It can be rewritten in the following matrix form: where , , and the transformation matrix is as follows: where , and .
Let and denote the lowfrequency and highfrequency subband signals of , respectively [22], which can be obtained by where . As one can see, the DCT of can be rewritten as where and are the subbands DCT and DST (discrete sine transform) of , respectively. Its vector form is as follows: where and denote the matrices of the subband DCT and DST, respectively, , and . According to (2.4), the 8point matrix can be written as Due to the orthogonality between and , and can be obtained from by In [22], the multistage subband decomposition is as follows. and denote the DCTs of and , respectively, which can be obtained by and , where is the transformation matrix We have Similarly, the 4point DCT computations of and can be obtained by for .
The 4point distributed matrix , can be defined as Let , , , and be the 2point DCT of , , , and , respectively, which can be computed by using the following 2point transformation matrix, : We have For the 2point DCT computations of , , , and , let where , we then have is a 2point matrix defined as Finally, according to equations (2.6)~(2.33) together with (2.4), we have where The following matrix, , can derived from (2.12)~(2.15). Similarly, the following matrix, , can be derived from (2.21)~(2.28) The decomposition matrix can be defined as and the coefficient matrix can be defined as According to (2.34), (2.38), and (2.39), we have The coefficient matrix can be represented by the reordered coefficient matrix , prepermutation matrix and postpermutation matrix , and can be written as where the matrices , , and can be defined as The reordered coefficient matrix can be represented as The computation of subcoefficient matrix can be written as where and . The above can be rewritten as [20] The inverse reordered coefficient matrix can be represented as As a result, the total number of multiplications of the subband based 8point IDCT is only 15.
3. The Unified SubbandBased Point DCT/IDCT Algorithm
The subbandbased DCT algorithm [22] can be unified for point DCT/IDCT due to the inherent regular pattern. For an point signal, , the unified subbandbased discrete cosine transform can be defined as where , is the decomposition matrix, is the reordered coefficient matrix, is the prepermutation matrix, and is the postpermutation matrix. The unified decomposition matrix can be written as where the basic matrix, , consists of two submatrices, and As noted, can be represented by the submatrices of , or submatrices, and as follows:
According to (3.1) and (3.2), the unified distributed matrix can be derived as According to (2.2) and (2.40), we have The 4point reordered coefficient matrix can be derived as According to (3.5), the 8point reordered coefficient matrix can be derived as Hence, the unified reordered coefficient matrix can be written as The 4point prepermutation matrix and postpermutation matrix can be written as According to (2.50), (2.51), (3.8), and (3.9), the 8point pre and postpermutation matrices can be written as Hence, the unified pre and postpermutation matrices can be represented aswhere denotes .
According to (2.47) and (2.53), the unified subbandbased IDCT can be obtained. where is the inverse decomposition matrix, is the inverse coefficient matrix, is the inverse pretransportation matrix, and is the inverse posttransportation matrix. Note that the decomposition matrix, reordered coefficient matrix, pre and postpermutation matrices are all orthonormal. Hence, we have
4. Analysis of Computation Complexity and System Performance
Based on the 8point subbandbased DCT and IDCT algorithm, the data flow of parallelpipelined processing for 8point DCT and IDCT are described as follows. The data flow of the subbandbased 8point DCT with six pipelined stages is shown in Figure 1. In which, , and , the matrixvector multiplication of in the first stage, takes one simpleaddition time for each element of . The preshuffle performs the prepermutation matrix operation in the second stage. The matrixvector multiplication is used to compute in the third and fourth stages. In the fifth stage, the postshuffle is used for the postpermutation matrix . The final stage is to compute by using simple shift operation with the Booth recoded algorithm.
Similarly, Figure 2 shows the data flow of the subbandbased 8point IDCT with seven pipelined stages.
In which, , , , and , it performs by using shift operation with the Booth recoded algorithm in the first stage. The preshuffle performs the prepermutation matrix operation in the second stage. The matrixvector multiplication is used to compute in the third and fourth stages. In the fifth stage, the postshuffle performs the postpermutation matrix . The sixth and seventh stages are to perform with simple addition. Both of the subbandbased DCT and IDCT algorithms need one multiplication operation with parallelpipelined processing, in comparison to [23] using linear array, which needs five multiplication operations.
Recall that the DCT of a signal, , can be represented as . The multiplication time of the unified subbandbased algorithm can be derived as where .
The log plot of the subbandbased DCT computations versus the number of DCT points is shown in Figure 3.
The multiplication time of the unified subbandbased DCT with single processor can be derived as follows [23]:
The left side of Figures 4(a), 4(b), 4(c), 4(d), and 4(e) show the original Lena, Baboon, Barbara, Peppers, and Boat images, respectively. The reconstructed Lena, Baboon, Barbara, Peppers and boat image shown in the right side of Figures 4(a), 4(b), 4(c), 4(d), and 4(e), respectively, are obtained by using the proposed subband based 8point DCT/IDCT algorithm with 32bit fixedpoint operands; the peak signaltonoise ratios (PSNRs) of Lena, Baboon, Barbara, Peppers, and Boat images are 149.67 dB, 142.12 dB, 143.08 dB, 143.36 dB, and 140.79 dB, respectively.
(a) Lena
(b) Baboon
(c) Barbara
(d) Peppers
(e) Boat
The PSNR curves of Lena, Baboon, Barbara, Peppers, and Boat images obtained by using the conventional 8point DCT and the proposed subbandbased 8point DCT at various word lengths are shown in Figure 5. Figures 6(a), 6(b), 6(c), 6(d), and 6(e) show the PSNR curves of Lena, Baboon, Barbara, Peppers, and Boat images obtained by using the conventional DCT and the proposed subbandbased DCT with 32bit operand at various DCT points. As one can see, the subbandbased DCT is preferable.
(a) Lena
(b) Baboon
(c) Barbara
(d) Peppers
(e) Boat
(a) Lena
(b) Baboon
(c) Barbara
(d) Peppers
(e) Boat
5. FPGA Implementation of the Reconfigurable LinearArray DCT/IDCT Processor
The reconfigurable architecture of the fast 8, 16, 32 and 64point DCT and IDCT processors based on the subbandbased 8point DCT is proposed in this section.
5.1. The Proposed 8Point DCT/IDCT Processor
According to the data flow of the subbandbased 8point DCT with six pipelined stages (Figure 1), the architecture of the proposed 8point DCT processor is shown in Figure 7. In which, the adder array (AA) with three CSA(4,2)s performing the matrixvector multiplication of is shown in Figure 8. Figure 9 shows the multiplier array (MA) performing three types of operation, which are needed to compute the subcoefficient matrix computation of . The control signals of swap and inv determine the types of operation. The functions determined by swap and inv are shown in Table 1. Figure 10 shows the hardwired shifters used for performing by the Booth recoded algorithm [23]. Figure 11 shows the proposed 8point IDCT processor with seven pipelined stages. In which, the fast adder arrays, shuffle, multiplier array, CLA, and hardwired shifters for DCT architecture can also be used for performing IDCT. The latch array for retiming the input data is shown in Figure 12.

The hardware complexity of the proposed subbandbased IDCT architecture is the same as that of the proposed subbandbased DCT architecture. Figure 13 shows the proposed integrated 8point DCT/IDCT processor.
5.2. The Proposed Reconfigurable DCT/IDCT Processor
According to the integrated 8point DCT/IDCT processor (Figure 13), the proposed reconfigurable 8, 16, 32, and 64point DCT/IDCT processor is shown in Figure 14. In which, the integrated adder array (IAA) for the fast computation of 8, 16, 32, and 64point DCT/IDCT is shown in Figure 15. The modified hardwired shifter (MHS) for multiplication by (where ) using the Booth recoded algorithm is shown in Figure 16.
In order to improve the computation efficiency, the number of multiplier arrays should be increased. The log plot of computation cycles versus number of multiplier arrays is shown in Figure 17.
5.3. FPGA Implementation of the Reconfigurable 2D DCT/IDCT Processor
The DCT is defined as [29] where for , and for . It can be rewritten as Thus, the separable 2D DCT can be obtained by using 1D DCT as follows: Similarly, the separable 2D IDCT can be obtained by using 1D IDCT as follows: As a result, the architecture of 2D DCT/IDCT can be implemented by using two successive 1D DCT/IDCT processors with only one transpose memory [29]. The proposed architecture of 2D DCT and IDCT is shown in Figure 18. In which, the control signals provided by the finite state machine (FSM) controller are used to manage the data flow and the operation timing for the DCT/IDCT and transpose memory; the transpose memory allows simultaneous read and write operations between the two processors while performing matrix transposition. The data read and written timing diagram for DCT/IDCT system is shown in Figure 19. In comparison with the conventional two transpose memories based 2D DCT/IDCT architectures, the proposed architecture utilizes only one transpose memory.
The platform for architecture development and verification has been designed as well as implemented in order to evaluate the development cost. The architecture has been implemented on the Xilinx FPGA emulation board [30]. The Xilinx Spartan3 FPGA has been integrated with the microcontroller (MCU) and I/O interface circuit (USB 2.0) to form the architecture development and verification platform. Figure 20 depicts block diagram and circuit board of the architecture development and evaluation platform. In which, the microcontroller reads data and commands from PC and writes the results back to PC by USB 2.0; the Xilinx Spartan3 FPGA implements the proposed 2D DCT/IDCT processor. The hardware code written in Verilog is for PC with the ModelSim simulation tool [31] and Xilinx ISE smart compiler [32]. It is noted that the throughput can be improved by using the proposed architecture while the computation accuracy is the same as that obtained by using the conventional one with the same word length. Thus, the proposed programmable DCT/IDCT architecture is able to improve the power consumption and computation speed significantly. The proposed processor for 8, 16, 32, and 64point DCT/IDCT is an extension of the 8point DCT/IDCT processor. Moreover, the reusable intellectual property (IP) DCT/IDCT core has also been implemented in Verilog for the hardware realization. All the control signals are internally generated on chip. The proposed DCT/IDCT processor provides both high throughput and low gate count.
6. Conclusion
With the advantages of the subband decomposition of a signal, a highefficiency algorithm with pipelined stages has been proposed for fast DCT/IDCT computations. It is noted that the proposed DCT/IDCT algorithm not only simplifies computation complexity but also improves system performance. The PSNR and system complexity of the proposed algorithm is better than those of the previous algorithms [33–36]. Table 2 shows comparisons between the proposed algorithm and architecture and other commonly used algorithms and architectures [24–28]. Thus, the proposed subbandbased DCT/IDCT algorithm is suitable for the realtime signal processing applications. The proposed DCT/IDCT processor provides both high throughput and low gate count and has been applied to various images with great satisfactions.