Abstract

The current trend of digital convergence leads to the need of the video encoder/decoder (codec) that should support multiple video standards on a single platform as it is expensive to use dedicated video codec chip for each standard. The paper presents a high performance circuit shared architecture that can perform the quantization of five popular video codecs such as H.264/AVC, AVS, VC-1, MPEG-2/4, and JPEG. The proposed quantizer architecture is completely division-free as the division operation is replaced by shift and addition operations for all the standards. The design is implemented on FPGA and later synthesized in CMOS 0.18 m technology. The results show that the proposed design satisfies the requirement of all five codecs with a maximum decoding capability of 60 fps at 187 MHz on Xilinx FPGA platform for 1080 p HD video.

1. Introduction

An evident trend in modern world is the digital convergence in the current electronic consumer products. People want the portable devices to have various functions like Video on Demand (VOD), Digital Multimedia Broadcasting (DMB), Global Positioning System (GPS) or the navigation system, Portable Multimedia Player (PMP), and so on. Due to such demand, it is necessary to support the widely used video compression standards in a single system-on-chip (SoC) platform. So the goal is to find a way so that the multicodec system achieves high performance, as well as low cost.

Most modern multimedia codecs (both encoder and decoder) employ transform-quantization pair as shown in Figure 1. A significant research has been conducted to combine and efficiently implement the transform units for multiple codecs, but little research is focused on the implementation of multiquantizer unit. A unified Inverse Discrete Cosine Transform (IDCT) architecture to support five standards (such as, AVS, H.264, VC-1, MPEG-2/4, and JPEG) is presented in [1]. A design to support the 4 × 4 transform and quantization of H.264 has been presented in [2]. The 8 × 8 transform and quantization for H.264 is presented in [3] and [4]. Several other designs based on H.264 codec have been reported in [510]. The authors in [11] present a design for the quantization for AVS. The design in [12] describes an MPEG-2 encoder. In [13], another JPEG encoder is implemented for images where the quantization block is designed using multiplication and shift operation instead of division. The design in [14] describes a multistandard video decoder to support four codecs—AVS, H.264, VC-1, and MPEG-2. Silicon Image Inc. currently supplies a Multi-standard High-Definition Video Decoder (MSVD-HD) core that supports H.264, VC-1, and MPEG- 1/2 codecs [15]. Their multiplexed decoder chip costs 970 K gates using TSMC 90 nm technology (including complete memory interfacing, stream reader functionality, and extra logic for context switch support).

However, none of the existing designs can compute the quantization of any video codecs. In this paper, we present a new division-free quantization algorithm (DFQA) and its efficient implementation to compute the quantization units for five multimedia codecs: JPEG [16], MPEG-2/4 [17], VC-1 [18], H.264/AVC [19], and AVS [20].

While developing the architecture, we have carefully considered all the quantization (Q) coefficients of the Q-tables of different standards and established a relationship between them. The quantization in MPEG-2/4 and JPEG is defined as the division of the DCT coefficient by the corresponding Q-values specified by the Q-matrices. On the other hand, the two most popular video standards, H.264/AVC and AVS exploit multiplication and shift operation for the purpose of quantization to avoid the division operation for reduced computational complexity. The quantization in VC-1 is user-defined and similar to the process in MPEG-2 [21]. Based on the observation, we propose a new multiquantizer architecture to support these five codecs. The architecture is later synthesized into both FPGA and ASIC level and the cost is compared with existing designs. The design serves as a key unit in a multicodec system in transcoding applications [22] and [23].

2. Proposed Division-Free Quantization Algorithm (DFQA)

Quantization (Q) is defined as the division of the DCT coefficient by the corresponding Q-value. But in H.264 and AVS, it is done by multiplication and right shift operation. Hence these standards define their own Multiplication Factors (). These MFs are multiplied with the transform coefficients and finally right shifted. However, the quantization in VC-1, MPEG-2/4, and JPEG is defined as division operation only (using 8 × 8 matrices). As a result, it is challenge to establish a relationship that is general enough to merge all these schemes. After careful observation, a novel generalized algorithm is developed that is divided into three steps given as follows.

Step 1.

Step 2.

Step 3. where denotes the transform coefficient, is the Multiplying factor, and denotes the corresponding quantized value (level). The description of the rest of the parameters is listed in Table 1.

Moreover, is the quantization parameter which specifies . In the next sections, we apply the general DFQA to individual codecs.

2.1. DFQA Applied to H.264

In this section, we apply the generalized DFQA to perform the quantization operation in H.264. Firstly, the transform coefficients coming from the transform unit is directly multiplied by as the value of is equal to 0 for this standard and hence no left shift operation is applied to the transform coefficients. In the second stage offset is left shifted by and then added to the result coming from the first stage. This value is finally right shifted by bits in the third stage, which is the final stage of the proposed DFQA. In case of H.264 as specified in [19], Multiplication Factor depends on m (= QPmod 6) and the position of the element as follows:

The matrix MF for H.264 is specified as

2.2. DFQA Applied to AVS

Next, we apply the DFQA to perform the quantization operation of AVS. In this case only the DFQA parameters are changed according to Table 1. For AVS, based on [25], depends on . Each specifies one particular . The value of for the particular is given by Table 2. Once is specified the corresponding is multiplied with the transform coefficient in the first step of the DFQA. Again in case of AVS, the transform coefficient is not left shifted by bits as the value of is 0 which is similar to the case of H.264. After that in second step this result is added to the 14 bits left shifted offset value, which is eventually right shifted by 15 bits in the final step.

2.3. DFQA Applied to VC-1 and MPEG-2/4

VC-1 uses multiple transform sizes but the same quantization rule is applied to all the coefficients. This standard allows both dead-zone and regular uniform quantization. In uniform quantization, the quantization intervals are identical. A dead-zone is an interval on the number line around zero, such that unquantized coefficients lying in the interval are quantized to zero. All the quantization intervals except the dead-zone are of same size—the dead-zone being typically larger. The use of dead-zone leads to substantially bit savings at low bitrates. However, at a high level the quantization process (scalar quantization where each transform coefficient is independently quantized and coded) in VC-1 is similar to the corresponding process in MPEG-2 standard. As the quantization in VC-1 is user defined and according to [21] this process is similar to the corresponding process in MPEG-2 standard in this proposed architecture, the quantization parameters for VC-1 and MPEG-2/4 standards are the same. As specified by [26], the MPEG-2/4 standard uses two quantization matrices, Intra matrix and non-intra-matrix. Here we focus only on the Intramatrix. The intraquantization matrix is shown in Figure 2:

However, we generate for each of the coefficient of the quantization matrix specified in Figure 2. This is then right shifted by 8 bits. For example, dividing a DCT coefficient coming from transform unit by a quantization value 19 can be expressed as Hence for 19, the corresponding is 14. Moreover, the quantization in VC-1 and MPEG/2-4 needs the denominator of (6) to be multiplied by quantization step (), and in this proposed architecture for simplicity we choose , where 5 is denoted as in Table 1. So the right-hand side of (6) can be characterized as The following matrix shows the for the corresponding quantization matrix in Figure 2: As compared to the elements in the original intramatrix in Figure 2, the elements in (8) are smaller which helps to decrease the size of the RAM. Once is obtained, in the first step this is multiplied with the 4 bits left shifted transform coefficient as the value of is 4 in this case. However, second step is not applicable to this standard and as a result the output of the first step is directly right shifted by bits in the third step.

2.4. DFQA Applied to JPEG

Similarly for JPEG codec, we calculate for each of the coefficients of the quantization matrix. The JPEG standard does not define any fixed quantization matrix. It is the prerogative of the user to select a quantization matrix. There are two quantization matrices provided in Annex K of the JPEG standard for reference. Here we focus only on the Luminance quantization matrix shown in Figure 3.

The following matrix shows the for the corresponding quantization matrix of Figure 3: Again the elements in (9) are smaller than those of the original Luminance matrix in Figure 3, which reduces the size of the RAM. After we calculate the , each of them is directly multiplied by the corresponding transform coefficient in the first step. Similar to VC-1 and MPEG-2/4 codecs, the second step is not applied to the JPEG standard. Finally, in third step of the proposed DFQA, the multiplied output of the initial step is right shifted by 8 bits which gives the desired quantized level. To reduce the quantization error in VC-1, MPEG-2/4, and JPEG, we also calculate with different amount of right shift operation. But that approach increases the hardware cost and cannot reduce the quantization error significantly. Hence we implement the quantization by using the with fixed 8 bit right shift.

3. Hardware Implementation of DFQA

The overall architecture of the proposed cost-sharing algorithm is shown in Figure 4(a). It can perform the 8 × 8 quantization of any of the five different standards as selected by the user (or another master system) using the select_standard pin. The proposed architecture contains three main blocks with four-stage pipelining: lookup tables to hold the multiplying factors (), one multiquantization unit (composed of only one shared multiplier), and one finite state machine to control all the standards. The description of these blocks and their operations are described below. However, processing is not necessary for hardware implementation as it does not process data but calculates parameters used for data processing in the quantization block. Hence we assume that processing is previously done by software.

The core unit of the multiquantizer unit, as shown in Figure 4(b), contains one general-purpose multiplier, adder and shared right shifter. Right shift operation depends on the select_standard pin. The look-up tables in Figure 4(c) contain all the matrices for five standards. However, one look-up table is used at a time based on the specified standard. The multiplexer is used to select the valid data from the look-up tables. To reduce the power consumption, only one look-up table is activated at a time by the enable pin. Once the standard is chosen by the user, the desired look-up table is activated by the controller and all the other look-up tables go into the sleep mode. The control logic assigns the enable signal as well as the MUX selection signals accordingly.

For H.264 standard according to Figure 4(c), there are six look-up tables—LUT_H_0, LUT_H_1, LUT_H_2, LUT_H_3, LUT_H_4, and LUT_H_5 for . If is greater than 5 even then the same look-up tables will be used, only the will be changed. For example, with and , LUT_H_0 and LUT_H_1 will be used, respectively. Similarly the rest of the look-up tables will be reused as the value of is being increased. MUX1 selects the desired for H.264 standard from these six look-up tables based on . LUT_AVS, LUT_VC-1/MPEG, and LUT_JPEG contain for AVS, VC-1/MPEG-2/4, and JPEG, respectively. The detailed block diagram of the entire operation with all the pipelining boundaries is shown in Figure 5.

According to Figure 5, the proposed architecture operates in four stages of pipeline, which consists of a row-column-generator to point the row and column of the look-up tables, several multiplexers to select the valid path, a shared multiplier for all five standards, a shared adder, and shared shifter. Here quantization is performed by multiplication and right shift operation rather than the division operation. In the first stage of pipelining, the value of row and col is generated to point to the row and column of the look-up table by the Row-Column-Generator with the help of the controller. This row and col is used to get the Multiplication Factor, from look-up table. After that in second stage of the pipeline operation the multiplexer selects the valid . Next in stage 3 the multiplier multiplies the transform coefficient coming from the transform unit with the desired . Moreover, at the same time in this stage the offset value is left shifted by by the left shifter only for H.264 and AVS standards. For VC-1, MPEG-2/4, and JPEG, this left shifter involved in stage 3 is not used. The output of the multiplier and the left shifter are added and finally right shifted in the fourth and final stage of the pipeline operation and the output register gives the quantized output. However, for VC-1, MPEG-2/4, and JPEG, the output of the multiplier is directly right shifted in this final stage and the output register returns the final quantized level. The left shift and right shift operation for AVS, H.264, VC-1, MPEG-2/4, or JPEG is chosen by the select_standard pin. In addition to the logic shown in Figure 5, there are pipelining registers between each stage of pipelining.

The design of the shared multiplier along with the shifter instead of the divider is the key part of the entire process. Although quantization is defined as the division operation, the AVS and H.264 standards define quantization as the multiplication and the right shift operation. To integrate the old standards like MPEG-2/4 and JPEG with AVS and H.264, we propose the whole architecture as a shared multiplication and right shift operation. Due to this strategy, the proposed architecture needs only a shared multiplier rather than both the multiplier (for AVS and H.264) and the divider (for VC-1, MPEG-2/4, and JPEG), which reduces the hardware complexity as well as the cost. Moreover, the whole design shares only one control circuit rather than using specific control circuit for each standard, which makes the design more cost effective.

4. Hardware Comparison

4.1. Performance Comparison in FPGA

The proposed architecture is implemented in Verilog HDL, and the operation is verified using Xilinx Vertex4 LX60 FPGA. The design is later synthesized using 0.18 μm CMOS technology. The proposed architecture costs 553 LUTs and 298 slices with a maximum operating frequency of 187.1 MHz. In Table 3, we summarize the performance (FPGA only) of our proposed multiquantizer supporting five standards in terms of hardware count, maximum working frequency, and supporting standards with other designs. The design in [2] has lesser hardware count than ours but supports only one standard, H.264.

For better evaluation, we have integrated the multi-transform design (for five codecs) in [1] with the proposed multiquantization scheme. The combined design is implemented in Xilinx FPGA (Vertex4 LX60) and the results are shown in Table 4. This combined architecture costs 1722 LUTs (4-input), 972 slices, and 1036 registers with a maximum operating frequency of 187 MHz. Note that, the operating frequency of the multitransform design in [1] is 194 MHz. In Table 4, we compare the performance (FPGA only) of this combined multi-DCT and multiquantizer with the existing designs; those include both DCT and quantization block. The designs in [6] and [10] can support only one standard, H.264, and hence has lesser hardware than ours. Compared to the existing designs, our design has higher frequency of operation with comparable hardware count. Thus, it can be seen from the Tables 3 and 4 that the proposed design can support the highest number of popular and widely used video standards (i.e., AVS, H.264/AVC, VC-1, JPEG, and MPEG-2/4) and still consumes relatively lesser hardware cost and runs at higher operational frequency.

4.2. Performance Comparison in VLSI

The proposed design is synthesized in CMOS 0.18 μm technology using Artisan library cells. It consumes 176,911 μm2 silicon area, 19.6 K gates, and 6.8 K standard cells. The frequency of operation is 88.5 MHz. As seen here, the operational frequency of our design is much higher in FPGA because of the use of optimized LUTs that are inherent to the type of FPGA chosen.

In Table 5, we compare the VLSI implementation of the proposed DFQA scheme with that of the estimated cost of the existing designs. Since, as of today, we have not come across to any design that can support all five codecs, we show an estimate of the projected cost. The costs of the standalone quantization units are added together to find the total estimated cost of five standalone codecs which is 31.1 K logic gates. The cost of the proposed circuit shared quantizer architecture is 19.6 K logic gates, which saves up to 36.7% than the estimated cost of five codecs. Figure 6 illustrates the comparison showing the percentage of reduction based on Table 5.

4.3. Estimated Area Saving in a Multicodec Design

In order to better assess the saving in hardware for the entire decoder, we have done a cost analysis as presented in Figure 7 where the costs of standalone and shared design are shown. The cost of a decoder for four codecs (H.264, VC-1, AVS, MPEG2/4) is taken from [14]. The cost of JPEG codec is taken from [24] and added with the previous to calculate the total cost for all five codecs. Here, MC is motion compensation, IP is intra prediction, VLD is variable length decoder, IQ is inverse quantization, and IT is inverse transform. To calculate the hypothetical cost of shared implementation, we have used the implementation cost for MC, IP, and VLD units from the shared design presented in [14]. Then the cost for IT (for shared design taken from our previous work [1]) and the cost for IQ (taken from the current work) are added. Thus, we can see that the shared design (that includes the proposed multicodec DFQA scheme) is estimated to save overall 41.1% area of a decoder compared to standalone design for five codecs.

In Table 6, we compare the decoding capability of the proposed multiquantization approach with other quantization only designs. While working at maximum capacity on Virtex4 LX60 FPGA, the proposed multiquantizer can achieve a frame rate of 60 fps (with 4 : 2 : 0 luma-chroma sampling, 187 × 106/(1,920 × 1,080 + 2 × 960 × 540) = 60.1 60). The decoding capability of other designs is also calculated using 4 : 2 : 0 sampling. It is seen in Table 6 that the proposed scheme achieves the highest decoding capacity.

4.4. Performance Evaluation Using Standard Images

In order to verify the functional correctness, in this section, we present the performance evaluation of the proposed algorithm using several standard gray-scale images. The images are first coded with transform (used in [1]) and quantization (presented here) operations, followed by a decoding (inverse) process. The quantization parameter (or quality factor) is set to 10 in all cases. The results are shown in terms of peak-signal-to-noise-ratio (PSNR) and presented below in Table 7. Figure 8 presents the original and reconstruction images of “Lena” and “Mandrill” for all five codecs.

5. Conclusion

In this paper, we present a high performance circuit shared architecture to perform the 8 × 8 quantization operation for five different multimedia codecs. The architecture replaces the hardware costly division operation with addition and shift operations. In addition only one control circuit is designed to control the entire architecture for all five standards. These strategies of using shared multiplier and shared control circuit result in a much lower hardware cost. The performance analysis shows that the proposed design satisfies the requirement of all codecs and achieves the competitive decoding capability. The scheme is later verified for functional correctness using standard images. Overall, the architecture is suitable for real-time application in modern multicodec systems.