VLSI Circuits, Systems, and Architectures for Advanced Image and Video Compression Standards
View this Special IssueResearch Article  Open Access
Maher Jridi, Ayman Alfalou, Pramod Kumar Meher, "Optimized Architecture Using a Novel Subexpression Elimination on Loeffler Algorithm for DCTBased Image Compression", VLSI Design, vol. 2012, Article ID 209208, 12 pages, 2012. https://doi.org/10.1155/2012/209208
Optimized Architecture Using a Novel Subexpression Elimination on Loeffler Algorithm for DCTBased Image Compression
Abstract
The canonical signed digit (CSD) representation of constant coefficients is a unique signed data representation containing the fewest number of nonzero bits. Consequently, for constant multipliers, the number of additions and subtractions is minimized by CSD representation of constant coefficients. This technique is mainly used for finite impulse response (FIR) filter by reducing the number of partial products. In this paper, we use CSD with a novel common subexpression elimination (CSE) scheme on the optimal Loeffler algorithm for the computation of discrete cosine transform (DCT). To meet the challenges of lowpower and highspeed processing, we present an optimized image compression scheme based on twodimensional DCT. Finally, a novel and a simple reconfigurable quantization method combined with DCT computation is presented to effectively save the computational complexity. We present here a new DCT architecture based on the proposed technique. From the experimental results obtained from the FPGA prototype we find that the proposed design has several advantages in terms of power reduction, speed performance, and saving of silicon area along with PSNR improvement over the existing designs as well as theXilinx core.
1. Introduction
Many applications such as video surveillance and patient monitoring systems require many cameras for effective tracking of living and nonliving objects. To manage the huge amount of data generated by several cameras, we proposed an optical implementation of an image compression based on DCT algorithm in [1]. But this solution suffers from bad image quality and higher material complexity. After this optical implementation, in this paper we propose a digital realization of an optimized VLSI for image compression system. This paper is an extension of our prior work [2–4] with a new compression scheme along with supplementary simulations and FPGA implementation followed by performance analysis.
More recent video encoders such as H.263 [5] and MPEG4 Part 2 [6] use the DCTbased image compression along with additional algorithms for motion estimation (ME). A simplified block diagram of the encoder is presented in Figure 1. The 2D DCT of blocks of the image is performed to decorrelate each block of input pixels. The DCT coefficients are then quantized to represent them in a reduced range of values using a quantization matrix. Finally, the quantized components are scanned in a zigzag order, and the encoder employs runlength encoding (RLE) and Huffman coding/binary arithmetic coding (BAC) based algorithms for entropy coding.
Since the DCT computation and quantization processes are computation intensive, several algorithms are proposed in literature for computing them efficiently in dedicated hardware. Research in this domain can be classified into three parts. The first part is the earliest and concerns the reduction of the number of arithmetic operators required for DCT computation [7–13]. The second research thematic relates to the computation of DCT using multiple constant multiplication schemes [14–25] for hardware implementation. Some other works on design of architectures for DCT make use of convolution formulation. They are efficient but can be used only for primelength DCT and not suitable for video processing applications [26, 27]. Finally, the third part is about the optimization of the DCT computation in the context of image and video encoding [28–32]. In this paper, we are interested in the last research thematic.
In this paper, we propose a novel architecture of the DCT based on the canonical signed digit (CSD) encoding [33, 34]. Hartley in [35] has used CSDbased encoding and common subexpression elimination (CSE) for efficient implementation of FIR filter. The use of similar CSD and CSE technique for DCT implementation is not suitable. To improve the efficiency of implementation, we identify multiple subexpression occurrences in intermediate signals (but not in constant coefficients as in [35]) in order to compute DCT outputs. Since the calculation of multiple identical subexpression needs to be implemented only once, the resources necessary for these operations can be shared and the total number of required adders and subtractors can be reduced.
The second contribution of the paper is an introduction of a new schema of image compression where the second stage of 1D DCT (DCT on the columns) is configured for joint optimization of the quantization and the 2D DCT computation. Moreover, tradeoffs between image visual quality, power, silicon area, and computing time are analysed.
The remainder of the paper is organized as follows: an overview of fundamental design issues is given in Section 2. Proposed DCT optimization based on CSD and subexpression sharing is described in Section 3. An algorithm based on joint optimization of quantization and 2D DCT computation is proposed in Section 4. Finally, the experimental results are detailed in the Section 5 before the conclusion.
2. Background
Given an input sequence , , the point DCT is defined as: where and if .
As stated in the introduction, we find two main types of algorithms for DCT computation. One class of algorithms is focused on reducing the number of required arithmetic operators, while the other class of algorithms are designed for hardware implementation of DCT. In this Section, we provide a brief review of the major developments of different types of algorithms.
2.1. Fast DCT Algorithm
In literature, many fast DCT algorithms are reported. All of them use the symmetry of the cosine function to reduce the number of multipliers. In [36] a summary of these algorithms is presented. In Table 1, we have listed the number of multipliers and adder involved in different DCT algorithms. In [13], the authors show that the theoretical lower limit of 8point DCT algorithm is 11 multiplications. Since the number of multiplications of Loeffler’s algorithm [12] reaches the theoretical limit, our work is based on this algorithm.
Loeffler et al. in [12] proposed to compute DCT outputs on four stages as shown in Figure 2. The first stage is performed by 4 adders and 4 subtractors while the second one is composed of 2 adders, 2 subtractors, and 2 MultAddSub (multiplier, adder and subtractor) blocks. Each MultAddSub block uses 4 multiplications and can be reduced to 3 multiplications by constant arrangements. The fourth stage uses 2 MultSqrt(2) blocks to perform multiplication by .
2.2. Multiplierless DCT Architecture
The DCT given by (1) can be expressed in inner product form as: where for are fixed coefficients and equal to , and for are the input image pixels.
One possible implementation of the inner product in programmable devices uses embedded multipliers. However, these IPs are not designed for constant multipliers. Consequently, they are not power efficient and consume a larger silicon area. Moreover, such a design is not portable for efficient implementation in FPGAs and ASICs. Many multiplierless architectures have, therefore, been introduced for efficient implementation of constant multiplications for the inner product computation. All those methods can be classified as: the ROMbased design [14], the distributed arithmetic (DA) based design [15], the New distributed arithmetic (NEDA) based design [22], and the CORDICbased design [23].
2.2.1. ROM MultiplierBased Implementation
This solution is presented in [14] to design a specialpurpose VLSI processor of 2D DCT/IDCT chip that can be used for highspeed image and video coding. Since the DCT coefficient matrix is fixed, the authors of [14] precompute all possible product values and store them in a ROM rather than computing them by any combinational logic. Since the dynamic range of input pixels is 2^{8} for gray scale images, the number of stored values in the ROM is equal to . Each value is encoded using 16 bits. For example, for an 8point inner product, the ROM size is about bits which is equivalent to 32.768 kbits. To obtain 8point DCT, 8point inner products are required and consequently, the ROM size becomes exorbitant for realization of image compression.
2.2.2. Distributed Arithmetic (DA)
Distributed arithmetic (DA) [15] is a wellknown technique for computing inner products which outperforms the ROMbased design. Authors of [16–18] use the recursive DCT algorithm to derive a design that requires less area than conventional algorithms. By precomputing all the partial inner products corresponding to all possible bit vectors and storing these values in a ROM, the DA method speeds up the inner product computation over the multiplierbased method. Unfortunately, in this case also the size of ROM grows exponentially with the number of inputs and internal precision. This is inherent to the DA technique where a great amount of redundancy is introduced into the ROM to accommodate all possible combinations of bit patterns in the input signal.
2.2.3. New Distributed Arithmetic (NEDA)
The New Distributed Arithmetic (NEDA) is adderbased optimization of DA implementation. The NEDA architecture does not require ROMs and multipliers. It provides reduced complexity solution by sharing of common subexpression of input vector to generate optimal shiftadd network for DCT implementation. This results in a lowpower, highthroughput architecture for the DCT. Nevertheless, the implementation of NEDA has two main disadvantages, [22]:(i)the parallel data input leads to higher scanning rate which severely limits the operating frequency of the architecture;(ii)the assumption of serial data input leads to lower hardware utilization.
2.2.4. CORDIC
COordinate Rotation DIgital Computer (CORDIC) provides a lowcost technique for DCT computation. The CORDICbased DCT algorithm in [23] utilizes dynamic transformation rather than static ROM addressing. The CORDIC method can be employed in two different modes: the rotation mode and the vectoring mode. Sun et al. in [24] have presented an efficient Loeffler DCT architecture based on the CORDIC algorithm. However, the use of dynamic computation of cosine function in iterative way involves long latency, highpower consumption and involves a costly scale compensation circuit.
2.2.5. CSD
Vinod and laiin [25] have proposed an algorithm to reduce the number of operations by using CSD and CSE techniques, where they have minimized the number of switching events in order to reduce the power consumption. In CSD encoding, the constant multiplications are replaced by additions. Hence, there are two types of additions: interstructural adders to compute the summation terms of the inner product of (2) and intrastructural adder required to replace the constant multipliers. The authors of [25] applied the CSD encoding on the constant multiplier of the conventional DCT computation. Consequently, the number of intrastructural adders is reduced, but the number of interstructural adders is increased to 56 for 8point DCT. However, as it is reported in Table 1, there are some fast DCT algorithms which use the cosine symmetry to reduce the number of adders (from 26 to 29). Moreover, the authors of [25] have used CSE technique to reduce the number of intrastructural adders. Indeed, since each data (image pixel) is multiplied with distinct constant element (cosine coefficient), Vinod and lai have proposed to reformulate the DCT matrix for efficient substitution of CSE. With this optimization, the total number of intrastructural adders substantially reduced.
2.3. Joint Optimization
Recent research on DCT implementation uses the optimization to adapt the implementation of the DCT to the specific compression standards in order to reduce the chip size and power consumption. All these recent works in this area exploit the context in which the DCT is used to reduce the computational complexity. Some of them use the characteristics of input signals and the others simplify the DCT architecture. Xanthopoulos and Chandraksan in [28] have exploited the signal correlation property to design a DCT core with lowpower dissipation. Yang and Wang have investigated the joint optimization of the Huffman tables, quantization, and DCT [31]. They have tried to find the performance limit of the JPEG encoder by proposing an iterative algorithm to find the optimal DCT bit width for a given Huffman tables and quantization step sizes.
A prediction algorithm is developed in [32] by Hsu and Cheng to reduce the computation complexity of the DCT and the quantization process of H264 standard. They have built a mathematical model based on the offset of the DCT coefficients to develop a prediction algorithm.
In this paper, we propose a model for combined optimization of DCT and quantization to implement them in the same architecture to save the computational complexity for image and video compression.
3. Proposed Algorithm for DCT Computation
In this Section, we present a new multiplierless DCT based on CSD encoding.
3.1. Principle of CSD
The CSD representation was first introduced by Avizienis in [33] as a signeddigit representation of numbers. This representation was created originally to eliminate the carry propagation chains in arithmetic operations. It is a unique signeddigit representation containing the fewest number of nonzero bits. It is therefore used for the implementation of constant multiplications with the minimum number of additions and subtractions. The CSD representation of any given number is given by: CSD numbers have two basic properties:(i)no two consecutive digits in a CSD number are nonzero;(ii)The CSD representation of a number contains the minimum possible number of nonzero bits and thus the name canonic.
The CSD values of the constants used in 8point DCT by Loeffler algorithm are listed in Table 2. The digits 1,−1 are, respectively, represented by +,−. For 8 bit width, the saving in term of partial products is about 24. A generalized statistical study about the average number of nonzero elements in bit CSD numbers is presented in [33], and it is proved that this number tends asymptotically to . Hence, on average, CSD numbers contain about 33 fewer nonzero bits than 2’s complement numbers. Consequently, for multiplications by a constant (where the bit pattern is fixed and known a priori), the numbers of partial products are reduced by nearly 33 in average.

3.2. New CSE Technique for DCT Implementation
To minimize the number of adders, subtractors, and shift operators for DCT computation, we can use the common subexpression elimination (CSE) technique over the CSD representation of constants. CSE was introduced in [35] and applied to digital filters in transpose form. Contrary to transpose form FIR filters, constant coefficients of DCT (shown in Table 2) multiply 8 different input data since the DCT consists in transforming 8point input sequence to 8point output coefficient. For this reason we cannot exploit the redundancy among the constants for subexpression elimination as in case of FIR filter. Moreover, for bit patterns in the same constant, Table 2 shows that only the constant presents one common subexpression which is +0− repeated once with an opposite sign. Consequently, we cannot use the conventional CSE technique in the same manner as in the case of multiple constant multiplication in FIR filters.
We have proposed here a new CSE approach for DCT optimization where we do not consider occurrences in CSD coefficients, but we consider the interaction of these codes. On the other hand, according to our compression method (detailed in the next Section) we use only some of the DCT coefficients (1 to 5 among 8). Hence, it is necessary to compute specific outputs separately. To emphasize the advantage of CSE, we take the example of . According to Figure 2, we can express as follows: Using CSD encoding of Table 2, (7) is equivalent to: After rearrangement (8) is equivalent to: In the same way, we can determine : Equations (10) and (11) give where CS1, CS2, and CS3 denote 3 common subexpressions. In fact, the identification of common subexpressions results in significant reduction of hardware and power consumption reductions. For example, CS2 appears 4 times in . This subexpression is implemented only once and resources needed to compute CS2 are shared. An illustration of resources sharing is given in Figure 3.
(a)
(b)
Symbols denote left shift operation by bit positions. It is important to notice that nonoverbraced terms in (12) are potential common subexpressions which could be shared with other DCT coefficients such as , , and . According to this analysis, is computed by using 11 adders and 4 embedded multipliers. If CSD encoding, is applied 23 adders/subtractors, are required. The proposed method enables to compute by using only 16 add/subtract operations. This improvement allows to save silicon area and reduces the power consumption without any decrease in the maximum operating frequency.
To emphasize the common subexpression sharing, a VHDL model of calculation of is developed using three techniques: embedded multipliers, CSD encoding, and CSE of CSD encoding. It is shown in Table 3 that the CSD encoding uses more adders, subtractors, and registers than the proposed combined CSDCSE technique to replace the 4 embedded multipliers MULT18x18SIOs. Also, we have included the equivalent number of LUT if is synthesized on Xilinx FPGA without any arithmetic DSP core (without MULT18x18). Total number of LUTs is found to be 305, 221, and 200, respectively, for the multiplierbased design, the CSDbased design, and the combined CSDCSEbased design. Moreover, it can be observed that the time required to get coefficient is equal to , , and , respectively, for the multiplierbased design, the CSDbased design, and the combined CSDCSEbased design, where is the addition time and is the multiplication time. Consequently, the areadelay product is decreased by sharing subexpression.
 
^{1}Maximum frequency is measured in MHz. 
4. Joint Optimization
4.1. Principle
As discussed earlier, the 2D DCT is computed in two stages by row/column decomposition using rowwise 1D DCT of input in stage 1, followed by columnwise 1D DCT of intermediate result in stage 2. If we consider an input block of samples, the rowwise transform calculates as 1D DCT of , and the columnwise transform gives which are the 1D DCT coefficients applied to for . Hence, for a given block of pixels, we obtain 64 DCT coefficients of different frequencies. Unlike the highfrequency coefficients, the lowfrequency coefficients have a greater effect on image reconstruction. Moreover, after quantization process, most of the highfrequency coefficients are likely to be zero as shown in Figure 4. Since these coefficients are likely to be zero after quantization, to save computation time and resources we avoid computing these DCT coefficients. In fact, for an block of pixels, we compute 64 1D DCT coefficients of the first stage and then we compute only low frequency components for 1D DCT of second stage. With this method, the quantization is done on the fly with the DCT algorithm and consequently we save computational resources. Another advantage of the proposed method is the latency improvement. Since the highfrequency coefficients are to be eventually discarded, the time used for their computation is saved. In fact, for the second 1D DCT algorithm, at least 3 rows do not need to be computed as illustrated in the Figure 5. This gives a saving of at least 12 clock cycles, since the latency of calculation of each row is 4 clock cycles.
4.2. Quantization Levels
For an block of pixels, the 1D DCT is calculated for each of the 8 input rows as mentioned in Figure 5. For each row, the first 1D DCT coefficient is encoded using 11 bits which is the estimated word length without truncation or rounding for . Coefficients to are truncated using 8 bits to trade accuracy for compression ratio and computational complexity.
In the second stage, 1D DCT is calculated selectively since the higher frequency components need not be computed. It is known that the lower frequencies tend to spread across either the first row or the first column of the 2D DCT coefficient matrix. However, the computation of an entire row and entire column leads to the computation of all DCT coefficients. For this purpose, we propose an efficient and simple computing scheme by creating 4 DCT zones (shown in Figure 5) where each zone corresponds to a specific compression ratio. For an block of pixels, the rowwise transform is applied to compute 64 DCT coefficients while the columnwise transform is applied partially to reduce the computational complexity. Indeed, the proposed quantization zones are chosen to be square in order to avoid redundancy of computation. In Zone 1, only 4 coefficients , , , and are calculated by 2 1D DCT operations. The first one is applied to the first column of intermediate result (output of the rowwise transform) which gives and while the second one is applied to the second column of intermediate result to compute and . For this quantization mode, all the others DCT coefficients are set to zero. Similarly, in Zone 4, 25 DCT coefficients are calculated by 5 1D DCT operations to compute coefficients , .
The compression ratio depends on the zone selection. In Zone 1, is encoded using 14 bits. Indeed, the first 1DDCT coefficient is encoded using 11 bits since in Loeffler DCT algorithm (shown in Figure 2), the DC output is obtained by three cascaded adder stages applied to 8bit image pixels. Since the 11bit DC output is fed to the second stage of 1D DCT the bit width of the first output of 2DDCT is equal to 14 bits. This bit width is taken as reference for encoding the AC coefficients which have less influence than DC coefficient on the quality of reconstructed images. To estimate the bit width of AC coefficients, we were referred to image and video quantization tables (Q) in JPEG standard for Luminance image component and in MPEG4 standard for intraframe video coding. It is found that for image quality of , is divided by . Then, in order to have a unique bit width by quantization zone, AC coefficients of zone 1 are encoded with 5 bits under the DC bit width (i.e., 9 bits). Likewise, AC coefficients of zone 2 need coefficients of zone 1 along with 5 other coefficients , , , , and . All these data are encoded using 8 bits. The additional coefficients of zone 3 are encoded using 7 bits. Finally, the remaining coefficients of zone 4 are encoded using 6 bits since will be divided by .
Hence, by selecting zone 1, the total number of bits is equal to bits and the compression ratio (CR) which is the ratio of the original image size to the compressed image size is equal to . For the zone 2, the compression ratio is equal to . Similarly, for zone 3 and 4, the CRs are, respectively, equal to 3.95 and 2.78.
4.3. DCT Calculation
An example of the calculation of the 1D DCT coefficient ( for ) is given in Section 3.2. To compute the DCT coefficients of the 1D DCT of second stage, we use the same method by replacing the inputs by and the outputs by for . According to the algorithm illustrated in Figure 2, for a given column, and are calculated using adders and subtractors while uses a multiplicative constant and is given by: Intermediate results and in (9) are shown in Figure 2 for a given column. Now, constants and are converted to CSD format and given, respectively, by and . , and calculations are given in Figure 6.
For and also the CSDCSE techniques are used. is calculated as in (8), and the common subexpression of calculation is determined by increasing the number of common subexpressions shared between and .
According to Figure 2, which is equivalent to: Using CSD encoding of the constant coefficient, (10) is equivalent to: After rearrangement (11) is equivalent to: For calculation, the common subexpression CS1 and CS3 defined for calculation is used. Two new subexpressions CS4 and CS5 are introduced to further reduce the arithmetic operators. It is important to mention that the equations listed before are expressed to create several occurrences of common subexpression such as CS1, CS3, and CS4 those are used for calculation. The Signal flow graphs of and are shown in Figure 7.
5. Simulation Results
We have coded the proposed method and the existing competing algorithms in VHDL and synthesized them using Xilinx ISE tool.
5.1. Synthesis Results
From highlevel synthesis results we obtain the number of adders used for different DAbased 1D DCT design and listed in Table 4. It is found that our design uses fewer adders than the other. The direct realization of DAbased DCT design requires 308 adders. Optimizations presented in [19] reduce the number of adders to 85. Regarding the CSDbased design [25], for 8bit constant width, we found that design of [25] consumes 123 adders (67 intrastrucutral adders + 56 interstructural adders) while the proposed design involves the DCT with 72 adders. We have listed the number of slices occupied by the 1D DCT of [37] and proposed design in the Table 5. The proposed method is compared favorably with the conventional multiplierless architectures using Xilinx XC2VP50 FPGA, the same device employed in [37].
Note that the Xilinx’s core uses the Chen’s algorithm [7] and requires 369 Slices along with 4 embedded multipliers 18x18SIOs. Besides, the number of slices required by the Xilinx core is relatively low compared with other designs because, the adder/subtractor module of the Xilinx’s design alternatively chooses addition and subtraction by using a toggle flop. However, the slicedelay product of proposed design is significantly less than that of the Xilinx DCT IP core since the later has a maximum usable frequency (MUF) of 101 MHz on Spartan3E device while the proposed design provides MUF of 119 MHz.
Regarding the timing analysis derived from synthesis results obtained by cadence 0.18 μ library, we find that the proposed design involves a delay about 14.4 ns which represents nearly 15% less than the Xilinx core and 60% less than optimized NEDAbased design [20]. We should underline that in [20] an optimized architecture of NEDAbased design [19] where compressor trees are used to decrease the delay.
Moreover, we have used the XPower tool of Xilinx ISE suite to estimate the dynamic power consumption. The power dissipation of the proposed 1D DCT design and Xilinx’s core is about 39 mW and 62 mW respectively.
In order to highlight the effect of subexpression sharing, 1D DCT structure of Loeffler algorithm is implemented with different multiplier designs. The powerdelay product in nJ is computed as the product of the DCT computation time (ns) and the power dissipation (W) for the proposed design and the Xilinx core. The powerdelay product for different number of DCT coefficients is calculated and plotted in Figure 8. The CSDbased design and the proposed design using CSE involve nearly 43 and 33 of powerdelay product of the Xilinx’s multiplierbased design, respectively. It can be seen in Figure 8 that the computation of only the first 1D DCT coefficient involves the same powerdelay product since this coefficient does not require any multiplier. Note that the computation of 4th DCT coefficient requires nearly the same powerdelay product as that for 5th DCT coefficient. Indeed, the computation of the fifth DCT coefficient requires only one more subtrator.
For the 2D DCT architecture using the Loeffler algorithm a performance analysis is presented in Table 6 in order to highlight the effects of CSD coding, subexpression sharing, and quantization. The multiplierbased structures considered in the comparison are the Xilinx’s embedded multiplier synthesized as multiplier block IP and in LUTs. It should be indicated that the input bit width is of 8 bits, the DC coefficient bit width of the first and second 1D DCT stages are 11 bits and 14 bits, respectively and the constant cosine coefficient bit width is 8 bits. The implementation of 2D DCT is realized by decomposing the 2D DCT into two 1D DCT computations together with a transpose memory. It can be observed in Table 6 that the areadelay complexity of Xinlinx’s multiplierbased 2D DCT design (synthesized in block) is nearly the same as that of the combined CSDCSE design but has nearly twice the power consumption. On the other hand, when the Xilinx’s multipliers are synthesized as LUT, the 2D DCT structure has less powerdelay product but involves twice the area compared with the combined CSDCSE structure.

The average computation time (ACT) is the time interval after which we get a set of 2D DCT coefficients. ACT is the product of the number of clock cycles required for the 2D DCT computation and the duration of a clock cycle. The 2D DCT computation requires 86 cycles, which is comprised of 8 cycles for register inputs, 7 cycles for the first stage 1D DCT, 64 cycles for transpose memory, and 7 cycles for the second stage of 1D DCT.
Finally, we use the energy per output coefficient (EoC) as power metric which amounts to the average of energy required to compute one value of 2D DCT output. EoC is calculated by multiplying the ACT by the power consumption and dividing the product by 64. It is shown in Table 6 that the design based on Xilinx’s multiplier IP involves more than twice EoC compared with all the other designs. Moreover, the proposed 2D DCT with CSDCSE technique needs less energy compared with CSDbased design and Xilinx’s multiplierbased design. It can be further observed that the proposed CSDCSE and quantization technique has 46 to 65 of EoC compared to CSDbased design.
5.2. FPGA Implementation of Image Compression
In this subsection, we examine the quality of reconstructed image using an FPGA prototype of the proposed DCTbased image compression unit. The test images are saved in a ROM in order to avoid the transmission time between the PC and the FPGA. It is important to mention that in the final design we need to use a 2bit word to indicate 4 available compression ratios. To measure the visual quality of the reconstructed image and to validate the proposed DCT design, we use the Xilinx’s integrated logic analyzer (ILA). This module works as a digital oscilloscope and enables to trigger on signals in the hardware design.
To reconstruct back the images, a floating point inverse 2D DCT function of Matlab tool is applied to the FPGA output. PSNR of different gray scale images are evaluated and listed in Table 7. The bit per pixel (bpp) depends on the quantization zone selection and varies from 0.64 to 2.87 (The compression is due to DCT only. To increase the compression ratio further the quantized DCT output needs to pass through the entropy coding which we have not performed here.). As shown in Table 7, a good or acceptable image visual qualities can be obtained by joint optimization of the quantization and the DCT. Moreover, to underline the adequacy between PSNR results and the user perception, in Figure 9 we have shown the decoded images for different selection of quantization zones. It is found that the higher the PSNR of reconstructed image, the better the quality is.

6. Conclusion
In this paper, we have presented a lowcomplexity DCTbased image compression. We presented a novel common subexpression sharing of intermediate signals of the DCT computation based on CSD representation of Loeffler’s 8point DCT algorithm. Finally, we have combined the quantization process with the second stage of DCT computation in order to optimize the bit width of computation of DCT coefficient according to the quantization of different zones.
We would like to point out that a prior detection of zeroquantized coefficients along with the proposed techniques could be used to further reduce the complexity of DCT computations.
References
 A. Alkholidi, A. Alfalou, and H. Hamam, “A new approach for optical colored image compression using the JPEG standards,” Signal Processing, vol. 87, no. 4, pp. 569–583, 2007. View at: Publisher Site  Google Scholar
 M. Jridi and A. Alfalou, “Joint Optimization of Lowpower DCT Architecture and Effcient Quantization Technique for Embedded Image Compression,” in VLSISoC: ForwardLooking Trends in IC and System Design, J. Ayala, A. Alonso, and R. Reis, Eds., pp. 155–181, Springer, Berlin, Germany, 2012. View at: Google Scholar
 M. Jridi and A. Alfalou, “A lowpower, highspeed DCT architecture for image compression: Principle and implementation,” in 18th IEEE/IFIP International Conference on VLSI and SystemonChip (VLSISoC '10), pp. 304–309, September 2010. View at: Publisher Site  Google Scholar
 M. Jridi and A. AlFalou, “A VLSI implementation of a new simultaneous images compression and encryption method,” in IEEE International Conference on Imaging Systems and Techniques (IST '10), pp. 75–79, July 2010. View at: Publisher Site  Google Scholar
 “Video coding for low bit rate communication,” (ITUT Rec. H.263), February 1998. View at: Google Scholar
 ISO/IEC DIS 10 9181, “Coding of audio visual objects: part 2. visual,” ISO/IEC 144962 (MPEG4 Part2), January 1999. View at: Google Scholar
 W. H. Chen, C. H. Smith, and S. C. Fralick, “A fast computational algorithm for the discrete cosine transform,” IEEE Transactions on Communications, vol. 25, no. 9, pp. 1004–1009, 1977. View at: Google Scholar
 B. G. Lee, “A new algorithm to compute the discrete cosine transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 6, pp. 1243–1245, 1984. View at: Google Scholar
 M. Vetterli and H. J. Nussbaumer, “Simple FFT and DCT algorithms with reduced number of operations,” Signal Processing, vol. 6, no. 4, pp. 267–278, 1984. View at: Google Scholar
 N. Suehiro and M. Hatori, “Fast algorithms for DFT and other sinusoidal tranforms,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 34, pp. 642–664, 1986. View at: Google Scholar
 H. Hou, “A fast recursive algorithm for computing the discrete cosine transform,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 35, pp. 1455–1461, 1987. View at: Google Scholar
 C. Loeffler, A. Lightenberg, and G. S. Moschytz, “Practical fast 1D DCT algorithm with 11 multiplications,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP '89), pp. 988–991, May 1989. View at: Google Scholar
 P. Duhamel and H. H’mida, “New 2n DCT algorithm suitable for VLSI implementation,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP '87), pp. 1805–1808, November 1987. View at: Google Scholar
 D. Slawecki and W. Li, “DCT/IDCT processor design for high data rate image coding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 2, no. 2, pp. 135–146, 1992. View at: Publisher Site  Google Scholar
 S. A. White, “Applications of distributed arithmetic to digital signal processing: a tutorial review,” IEEE ASSP Magazine, vol. 6, no. 3, pp. 4–19, 1989. View at: Publisher Site  Google Scholar
 A. Madisetti and A. N. Willson, “100 MHz 2D 8 × 8 DCT/IDCT processor for HDTV applications,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 5, no. 2, pp. 158–165, 1995. View at: Publisher Site  Google Scholar
 S. Yu and E. E. Swartzlander, “DCT implementation with distributed arithmetic,” IEEE Transactions on Computers, vol. 50, no. 9, pp. 985–991, 2001. View at: Publisher Site  Google Scholar
 D. W. Kim, T. W. Kwon, J. M. Seo et al., “A compatible DCT/IDCT architecture using hardwired distributed arithmatic,” in IEEE International Symposium on Circuits and Systems (ISCAS '01), pp. 457–460, May 2001. View at: Google Scholar
 A. Shams, W. Pan, A. Chidanandan, and M. Bayoumi, “A low power high performance distributed DCT architecture,” in IEEE Computer Society Annual Symposium on VLSI (ISVLSI '02), pp. 21–27, 2002. View at: Google Scholar
 A. Chidanandan, J. Moder, and M. Bayoumi, “Implementation of NEDAbased DCT architecture using evenodd decomposition of the 8 × 8 DCT matrix,” in 49th Midwest Symposium on Circuits and Systems (MWSCAS '06), pp. 600–603, August 2007. View at: Publisher Site  Google Scholar
 P. K. Meher, “Unified systoliclike architecture for DCT and DST using distributed arithmetic,” IEEE Transactions on Circuits and Systems I, vol. 53, no. 12, pp. 2656–2663, 2006. View at: Publisher Site  Google Scholar
 M. Alam, W. Badawy, and G. Jullien, “A new time distributed DCT architecture for MPEG4 hardware reference model,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 5, pp. 726–730, 2005. View at: Publisher Site  Google Scholar
 S. Yu and E. E. Swartzlander, “A scaled DCT architecture with the CORDIC algorithm,” IEEE Transactions on Signal Processing, vol. 50, no. 1, pp. 160–167, 2002. View at: Publisher Site  Google Scholar
 C. C. Sun, S. J. Ruan, B. Heyne, and J. Goetze, “Lowpower and highquality Cordicbased Loeffler DCT for signal processing,” IET Circuits, Devices and Systems, vol. 1, no. 6, pp. 453–461, 2007. View at: Publisher Site  Google Scholar
 A. P. Vinod and E. M. K. Lai, “Hardware efficient DCT implementation for portable multimedia terminals using subexpression sharing,” in IEEE Region 10 Annual International Conference (TENCON '04), pp. A227–A230, November 2004. View at: Google Scholar
 C. Cheng and K. K. Parhi, “A novel systolic array structure for DCT,” IEEE Transactions on Circuits and Systems II, vol. 52, no. 7, pp. 366–369, 2005. View at: Publisher Site  Google Scholar
 P. K. Meher, “Systolic designs for DCT using a lowcomplexity concurrent convolutional formulation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 16, no. 9, pp. 1041–1050, 2006. View at: Publisher Site  Google Scholar
 T. Xanthopoulos and A. P. Chandrakasan, “A lowpower dct core using adaptive bitwidth and arithmetic activity exploiting signal correlations and quantization,” IEEE Journal of SolidState Circuits, vol. 35, no. 5, pp. 740–750, 2000. View at: Google Scholar
 J. Huang and J. Lee, “A selfreconfigurable platform for scalable dct computation using compressed partial bitstreams and blockram prefetching,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 19, no. 11, pp. 1623–1632, 2009. View at: Publisher Site  Google Scholar
 J. Huang and J. Lee, “Efficient VLSI architecture for video transcoding,” IEEE Transactions on Consumer Electronics, vol. 55, no. 3, pp. 1462–1470, 2009. View at: Publisher Site  Google Scholar
 E. H. Yang and L. Wang, “Joint optimization of runlength coding, Huffman coding, and quantization table with complete baseline JPEG decoder compatibility,” IEEE Transactions on Image Processing, vol. 18, no. 1, pp. 63–74, 2009. View at: Publisher Site  Google Scholar
 C. L. Hsu and C. H. Cheng, “Reduction of discrete cosine transform/quantisation/inverse quantisation/inverse discrete cosine transform computational complexity in H.264 video encoding by using an efficient prediction algorithm,” IET Image Processing, vol. 3, no. 4, pp. 177–187, 2009. View at: Publisher Site  Google Scholar
 A. Avizienis, “Signeddigit number representations for fast parallel arithmetic,” IRE Transaction on Electronic Computers, vol. 10, pp. 389–400, 1961. View at: Google Scholar
 R. H. Seegal, “The canonical signed digit code structure for FIR filters,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 5, pp. 590–592, 1980. View at: Google Scholar
 R. T. Hartley, “Subexpression sharing in filters using canonic signed digit multipliers,” IEEE Transactions on Circuits and Systems II, vol. 43, no. 10, pp. 677–688, 1996. View at: Google Scholar
 C. Y. Pai, W. E. Lynch, and A. J. AlKhalili, “Lowpower datadependent 8 × 8 DCT/IDCT for video compression,” IEE Proceedings: Vision, Image and Signal Processing, vol. 150, no. 4, pp. 245–255, 2003. View at: Publisher Site  Google Scholar
 B. I. Kim and S. G. Ziavras, “Lowpower multiplierless DCT for image/video coders,” in 13th International Symposium on Consumer Electronics (ISCE '09), pp. 133–136, May 2009. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2012 Maher Jridi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.