Research Article  Open Access
TzeYun Sung, YawShih Shieh, HsiChin Hsin, "An Efficient VLSI Linear Array for DCT/IDCT Using Subband Decomposition Algorithm", Mathematical Problems in Engineering, vol. 2010, Article ID 185398, 21 pages, 2010. https://doi.org/10.1155/2010/185398
An Efficient VLSI Linear Array for DCT/IDCT Using Subband Decomposition Algorithm
Abstract
Discrete Cosine transform (DCT) and inverse DCT (IDCT) have been widely used in many image processing systems and realtime computation of nonlinear time series. In this paper, a novel lineararray of DCT and IDCT is derived from the data flow of subband decompositions representing the factorized coefficient matrices in the matrix formulation of the recursive algorithm. For increasing the throughput as well as decreasing the hardware cost, the input and output data are reordered. The proposed 8point DCT/IDCT processor with four multipliers, simple adders, and less registers and ROM storing the immediate results and coefficients, respectively, has been implemented on FPGA (field programmable gate array) and SoC (system on chip). The lineararray DCT/IDCT processor with the computation complexity and hardware complexity is fully pipelined and scalable for variablelength DCT/IDCT computations.
1. Introduction
With rapid growth of modern communication applications and computer technologies, image compression and realtime computation of nonlinear time series continues to be in great demand. Discrete Cosine transform (DCT) is one of the major operations in various image/video compression standards [1] and nonlinear time series applications [2–8]. Though fast Fourier transform (FFT) can be used to implement DCT, it requires complexvalued computations; and moreover, Npoint DCT by FFT contains stages. The conventional DCT architectures using distributed arithmetic involve complex hardware with a great number of registers [9–19]. Other commonly used DCT architectures with matrix formulation and distributed memory [20–27] are however not suited for VLSI implementation because the hardware complex is proportional to the length of DCT, which leads to the scalability problem of variablelength DCT computations. In this paper, we propose the novel lineararray architecture for scalable DCT/IDCT implementation.
The remainder of this paper proceeds as follows. In Section 2, we propose the fast DCT/IDCT computation based on subband decomposition algorithm. In Section 3, the reconfigurable FPGAbased and programmable SoC implementations with low hardware cost are proposed for the fast DCT/IDCT computation. The performance comparison with conclusions can be found in Section 4.
2. Proposed Fast DCT/IDCT Computation
For an Npoint signal, , the discrete cosine transform (DCT) [28] is defined as where , and for . Let and denote the lowfrequency and highfrequency subband signals of , respectively, which are defined as where . The original signal can be obtained from and as follows: As one can see, the DCT of can be rewritten as where and are the subband DCT and DST (discrete sine transform) of , respectively.
2.1. Fast DCT Computation Based on Subband Decomposition Algorithm
Without loss of generality, the 8point fast DCT based on subband decomposition algorithm is proposed for the widely used JPEG and MPEG1/2 standards, which can be easily extended to variablelength DCT computations. The vector form of 8point DCT can be written as where , , , and and denote the matrices of subband DCT and subband DST, respectively, which can form orthonormal bases for the two orthogonal subspaces of . Notice that, due to the orthogonality between and , and can be obtained from as follows: where , and .
The proposed fast DCT algorithm is a subband decompositionbased multistage algorithm. Specifically, let where . And let where . Based on subband decompositions using (2.2), (2.7), and (2.8), data flow of computing the 2point subband DCT: and subband DST: for the 8point DCT is shown in Figure 1. As one can see, data flow of computing and can be obtained in a similar way, and therefore is not shown in Figure 1. All of the 2point subband DCTs and DSTs are given by Thus, we have where is the original signal, and
Similarly, we have the following: Figure 2 depicts the relationship between and , which can be obtained by the following: According to (2.24)–(2.27), we have Finally, the proposed 8point DCT computation based on subband decomposition is as follows: where Figure 8 shows block diagram of the proposed DCT computation; one of the advantages is that is orthogonal, and all of the submatrices of are orthonormal.
2.2. Fast IDCT Computation Based on Subband Decomposition Algorithm
According to (2.29), IDCT can be obtained by where As is orthogonal and all of the submatrices of are orthonormal, the inverse of and can be obtained easily. In addition, it takes only twenty multiplication operations for both DCT and IDCT.
3. VLSI Implementation of an Efficient LinearArray DCT/IDCT Processor
Based on the proposed approach to fast DCT computation shown in Figure 8, an efficient architecture for implementing the fast DCT/IDCT processor is thus presented in this section. Recall that the DCT of a signal, , can be efficiently obtained by . Let , then we have . Figure 9 shows the matrixvector multiplication of , in which six CSA(3,2)s (carrysaveadder (3,2)) and one CSA (carrysaveadder) [29, 30] are utilized, and therefore four simpleaddition time and one CSA computation time is required to compute each element of . Figures 10 and 11 show the Multiplier array (MA) consisted of four multipliers and the CSA array (CA) consisted of eight CSAs, respectively, which are used to compute the matrixvector computation of ; thus, only one multiplication time with one CSA computation time is needed to compute each element of , that is, the DCT coefficient. Table 3 depicts data flow of the proposed fast DCT processor with pipelined lineararray architecture [31]. As a result, only five multiplication cycles with five addition cycles are needed to compute 8point DCT. In general, for Npoint DCT, the computation time and hardware complexity of the proposed fast DCT processor are and , respectively.
Table 4 shows data flow of the proposed fast IDCT algorithm [31], where is the DCT of an 8point signal ; , and . Figure 12 shows the socalled full CSA(4,2) (FCSA(4,2)) consisted of two CSA(3,2) and one CSA for the computation of [29, 30]. It is noted that the CSA array consisted of eight CSAs shown in Figure 11 can also be used for the computation of . As shown in Table 4, only five multiplication cycles with three addition cycles are needed to compute 8point IDCT. As one can see, the computation time and hardware complexity of the proposed fast IDCT architecture are the same as that of the proposed fast DCT architecture. In addition, only 16word RAM/registers and 10word ROM are required to store the intermediate results and constants, respectively; and the latency time is only 5multiplicationcycle.
Figure 13 shows system block diagram of the proposed fast DCT/IDCT architecture. The platform for architecture development and verification has been designed as well as implemented in order to evaluate the development cost. Figure 14 depicts block diagram of the platform, in which the 8051 microcontroller reads data from PC via DMA channel and writes the result back to PC by USB 2.0 bus; the Xilinx XC2V6000 FPGA chip implements the proposed DCT processor [32]. The architecture development and verification board shown in Figure 15 are to verify and evaluate the proposed DCT/IDCT architecture. Moreover, the reusable intellectual property (IP) DCT/IDCT core has also been implemented in Matlab for functional simulations. The hardware code written in Verilog is running on a workstation with the ModelSim simulation tool and Xilinx ISE smart compiler. In addition, the FPGA platform shown in Figure 14 is to verify and evaluate the proposed DCT architecture. It is noted that the throughput can be improved by using the proposed architecture while the computation accuracy is the same as that obtained by using the conventional one with the same word length.
The SoC is synthesized by the TSMC 0.18 1P6M CMOS cell libraries [33]. The physical circuit is synthesized by the Astro tool. The circuit is evaluated by DRC, LVS, and PVS [34]. Figure 16 shows the cellbased design flow. The layout view of the 8point DCT/IDCT processor with 32bit operand is shown in Figure 17. The core areas are obtained by the Synopsys design analyzer. The power consumptions are obtained by the PrimePower. The reported core size of the implemented the proposed processor is and the power dissipation is 102.2 mW at 1.8 V with clock rate of 1 GHz. Thus, the proposed programmable DCT/IDCT architecture is able to improve the power consumption and computation speed significantly. All the control signals are internally generated onchip. The proposed DCT/IDCT processor provides both highthroughput and low gate count.
The proposed reconfigurable DCT/IDCT processor used to compute point DCT/IDCT on FPGA are composed mainly of the 8point DCT/IDCT core; the computation complexity using a single 8point DCT/IDCT core is O(5N/8) for extending Npoint DCT/IDCT computation. Note that the transform matrices used for the proposed linear array with 8point DCT core can be extended to a variety of different sizes. Thus, the proposed architecture is highly scalable.
The lineararray architecture with use of hardware resources has been proposed for trade offs of performance, chip area and power consumption. As a result, it has the advantage of balancing the need for power saving with computation speed.
4. Conclusion
By taking advantage of subband decomposition, a highefficiency architecture with pipelined structures is proposed for fast DCT/IDCT computation. Specifically, the proposed DCT/IDCT architecture not only improves throughput by more than two times that of the conventional architectures [9–11, 15–19], but also saves memory space significantly [1, 9–22]. Table 1 shows comparisons between the proposed architecture and the conventional architectures [1, 9–14] (with dual memory banks), and [15–19]. Table 2 shows comparisons with other commonly used architectures [1, 12–14, 20–24]. For DCT, the algorithm proposed by Feig requires 54 multiplications and 462 additions [27]; the proposed method requires 25 multiplications and 100 additions. Thus, the performance of this work is superior to that of the Feig algorithm. In addition, the proposed fast DCT/IDCT architecture is highly regular, scalable, and flexible. The DCT/IDCT processor designed by using the portable and reusable Verilog is a reusable IP, which can be implemented in various processes; combined with efficient use of hardware resources for tradeoffs of performance, area and power consumption; and therefore is much suited to the JPEG and MPEG1/2 applications.




Acknowledgments
The National Science Council of Taiwan, Taipei, Taiwan, under Grant NSC982221E216037 and the Chung Hua University, Hsinchu, Taiwan, under Grant no. CHUNSC982221E216037 supported this work.
References
 T.Y. Sung, “Memoryefficient and highperformance 2D DCT and IDCT processors based on CORDIC rotation,” WSEAS Transactions on Electronics, vol. 3, no. 12, pp. 565–574, 2006. View at: Google Scholar
 M. Li and W. Zhao, “Representation of a stochastic traffic bound,” IEEE Transactions on Parallel and Distributed Systems, preprint. View at: Publisher Site  Google Scholar
 Ming Li, “Fractal time series—a tutorial review,” Mathematical Problems in Engineering, vol. 2010, Article ID 157264, 26 pages, 2010. View at: Publisher Site  Google Scholar  MathSciNet
 M. Li and S. C. Lim, “Modeling network traffic using generalized Cauchy process,” Physica A, vol. 387, no. 11, pp. 2584–2594, 2008. View at: Publisher Site  Google Scholar
 C. Cattani, “Harmonic wavelet approximation of random, fractal and high frequency signals,” Telecommunication Systems, vol. 43, no. 34, pp. 207–217, 2010. View at: Publisher Site  Google Scholar
 E. G. Bakhoum and C. Toma, “Mathematical transform of travelingwave equations and phase aspects of quantum interaction,” Mathematical Problems in Engineering, vol. 2010, Article ID 695208, 15 pages, 2010. View at: Publisher Site  Google Scholar  MathSciNet
 M. Li, “Generation of teletraffic of generalized Cauchy type,” Physica Scripta, vol. 81, no. 2, Article ID 025007, 2010. View at: Publisher Site  Google Scholar
 M. Li and J.Y. Li, “On the predictability of longrange dependent series,” Mathematical Problems in Engineering, vol. 2010, Article ID 397454, 9 pages, 2010. View at: Publisher Site  Google Scholar
 T. Y. Sung, “VLSI parallel and distributed computation algorithms for DCT processors,” in Proceedings of the IEEE International Phoenix Conference on Computer and Communications, pp. 121–125, Scottsdale, Ariz, USA, 1990. View at: Google Scholar
 T. Y. Sung, “VLSI parallel and distributed processing algorithms for multidimensional discrete cosine transforms,” in Proceedings of the the TwoTrack International Conference on Databases, Parallel Architectures, and Their Applications, pp. 36–39, Miami Beach, Fla, USA, March 1990. View at: Google Scholar
 T. Y. Sung, “Novel parallel VLSI Architectures for discrete cosine transforms,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, pp. 998–1001, Albuquerque, New Mexico, USA, April 1990. View at: Google Scholar
 T. Y. Sung and Y. H. Sung, “A novel implementation of costeffective parallelpipelined $8\times 8$ DCT processor,” in Proceedings of the 4th IEEE AsiaPacific Conference on Advanced System Integrated Circuits (APASIC '04), pp. 200–203, Fukuoka, Japan, August 2004. View at: Google Scholar
 T. Y. Sung, Y. S. Shieh, and H. C. Hsin, “Memory efficiency and highspeed architectures for forward and inverse DCT with multiplierless operation,” in Proceedings of the Advances in Image and Video technology, vol. 4319 of Lecture Notes in Computer Science, pp. 802–811, Springer, Berlin, Germany, December 2006. View at: Google Scholar
 T. Y. Sung, Y. S. Shieh, and H. C. Hsin, “Highefficiency and lowpower architectures for 2D DCT and IDCT based on CORDIC rotation,” in Proceedings of the 7th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT '06), pp. 191–196, December 2006. View at: Google Scholar
 Y. H. Hu and Z. Wu, “An efficient CORDIC array structure for the implementation of discrete cosine transform,” IEEE Transactions on Signal Processing, vol. 43, no. 1, pp. 331–336, 1995. View at: Publisher Site  Google Scholar
 H. Jeong, J. Kim, and W.K. Cho, “Lowpower multiplierless DCT architecture using image data correlation,” IEEE Transactions on Consumer Electronics, vol. 50, no. 1, pp. 262–267, 2004. View at: Publisher Site  Google Scholar
 D. Gong, Y. He, and Z. Gao, “New costeffective VLSI implementation of a 2discrete cosine transform and its inverse,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 14, no. 4, pp. 405–415, 2004. View at: Google Scholar
 V. Dimitrov, K. Wahid, and G. Jullien, “Multiplicationfree $8\times 8$ 2D DCT architecture using algebraic integer encoding,” Electronics Letters, vol. 40, no. 20, pp. 1310–1311, 2004. View at: Publisher Site  Google Scholar
 M. Alam, W. Badawy, and G. Jullien, “A new time distributed DCT architecture for MPEG4 hardware reference model,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 5, pp. 726–730, 2005. View at: Publisher Site  Google Scholar
 Y. P. Lee, T. H. Chen, L. G. Chen, and C. W. Ku, “A costeffective architecture for $8\times 8$ twodimensional DCT/IDCT using direct method,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 7, no. 1, pp. 459–467, 1997. View at: Google Scholar
 Y.T. Chang and C.L. Wang, “New systolic array implementation of the 2D discrete cosine transform and its inverse,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 5, no. 2, pp. 150–157, 1995. View at: Publisher Site  Google Scholar
 S.F. Hsiao and W.R. Shiue, “A new hardwareefficient algorithm and architecture for computation of 2D DCTs on a linear array,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, no. 11, pp. 1149–1159, 2001. View at: Publisher Site  Google Scholar
 S.F. Hsiao and J.M. Tseng, “New matrix formulation for twodimensional DCT/IDCT computation and its distributedmemory VLSI implementation,” IEE Proceedings. Vision, Image and Signal Processing, vol. 149, no. 2, pp. 97–107, 2002. View at: Publisher Site  Google Scholar
 H. S. Hou, “A fast recursive algorithm for computing the discrete cosine transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 10, no. 35, pp. 1455–1461, 1987. View at: Google Scholar
 S.F. Hsiao, W.R. Shiue, and J.M. Tseng, “Design and implementation of a novel lineararray DCT/IDCT processor with complexity of order Iog_{2} N,” IEE Proceedings. Vision, Image and Signal Processing, vol. 147, no. 5, pp. 400–408, 2000. View at: Publisher Site  Google Scholar
 Z. Cvetkovic and M. V. Popovic, “New fast recursive algorithms for the computation of discrete cosine and sine transforms,” IEEE Transactions on Signal Processing, vol. 40, no. 8, pp. 2083–2086, 1992. View at: Publisher Site  Google Scholar  Zentralblatt MATH
 E. Feig and S. Winograd, “Fast algorithms for the discrete cosine transform,” IEEE Transactions on Signal Processing, vol. 40, no. 9, pp. 2174–2193, 1992. View at: Publisher Site  Google Scholar  Zentralblatt MATH
 N. I. Cho and S. U. Lee, “Fast algorithm and implementation of 2D discrete cosine transform,” IEEE transactions on circuits and systems, vol. 38, no. 3, pp. 297–305, 1991. View at: Publisher Site  Google Scholar
 I. Koren, Computer Arithmetic Algorithm, chapter 5, A. K. Peters, Natick, Mass, USA, 2nd edition, 2005.
 T.Y. Sung and H.C. Hsin, “Design and simulation of reusable IP CORDIC core for specialpurpose processors,” IET Computers and Digital Techniques, vol. 1, no. 5, pp. 581–589, 2007. View at: Publisher Site  Google Scholar
 G. H. Golub and C. F. Van Loan, Matrix Computations, Johns Hopkins Studies in the Mathematical Sciences, chapter 6, Johns Hopkins University Press, Baltimore, Md, USA, 3rd edition, 1996. View at: MathSciNet
 Xilinx FPGA products, http://www.xilinx.com/products/.
 “TSMC 0.18 CMOS Design Libraries and Technical Data, v.5.1,” Taiwan Semiconductor Manufacturing Company (TSMC), Hsinchu, Taiwan, and National Chip Implementation Center (CIC), National Science Council, Hsinchu, Taiwan, 2009. View at: Google Scholar
 Cadence design systems, http://www.cadence.com/products/pages/default.aspx.
Copyright
Copyright © 2010 TzeYun Sung et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.