Abstract

The paper presents a unified hybrid architecture to compute the integer inverse discrete cosine transform (IDCT) of multiple modern video codecs—AVS, H.264/AVC, VC-1, and HEVC (under development). Based on the symmetric structure of the matrices and the similarity in matrix operation, we develop a generalized “decompose and share” algorithm to compute the IDCT. The algorithm is later applied to four video standards. The hardware-share approach ensures the maximum circuit reuse during the computation. The architecture is designed with only adders and shifters to reduce the hardware cost significantly. The design is implemented on FPGA and later synthesized in CMOS 0.18 um technology. The results meet the requirements of advanced video coding applications.

1. Introduction

In recent years, different video applications use different video standards, such as H.264/AVC [1], VC-1 [2], and AVS [3]. To improve the coding efficiency further, recently a joint collaboration team on video coding (JCT-VC) is drafting a next generation video coding standards, known tentatively as high efficient video coding (HEVC or H.265) [4]. The target bit rate is half of that of H.264/AVC. Besides, several other effective techniques are proposed in the draft to reduce the complexity of the encoder such as improved intrapicture coding, and simpler VLC coefficients [5]. As a result of these new features, experts predict that the HEVC will dominate the future multimedia market.

In order to meet up the present and future demands of different multimedia applications, it becomes necessary to develop a unified video decoder that can support all popular video standards on a single platform. In recent years, there is a growing interest to develop multistandard inverse transform architectures for advanced multimedia applications. However, most of them do not support AVS, the video codec developed by Chinese government that became the core technology of China Mobile Multimedia Broadcasting (CMMB) [6]. None of the existing works supports the HEVC; thought it is not finalized yet, considering the future prospective of the HEVC [7], it is important to start exploring possible implementation in hardware of the transform unit discussed in the draft.

In this paper, we present a new generalized algorithm and its hardwire implementation of an 8 × 8 IDCT architecture. The scheme is based on matrix decomposition with sparse matrices and offset computations. These sparse matrices are derived in a way that can be reused maximum number of times during decoding different inverse matrices. All multipliers in the design are replaced by adders and shifters. In the scheme, we first split the 8 × 8 transformation matrix into two small 4 × 4 matrices by applying permutation techniques. Then we concurrently perform separate operations on these two matrices to compute the output. It enables parallel operation and yields high throughput, which eventually helps meet the coding requirement of the high resolution video.

The proposed generalized algorithm is later applied to compute the 8 × 8 integer IDCT of AVS. Then we identify the submatrices of AVS and reuse them to compute the IDCT of VC-1. We follow the same principle to compute the other two IDCTs of H.264 and HEVC. For HEVC, we have used the draft matrix discussed in the recent meeting [7]; since it is not yet finalized, we have developed the generalized architecture in such a way that can be easily adjusted to accommodate any changes to the final HEVC format.

2. Previous Works

In recent years, some multistandard inverse transform architectures have been proposed for video applications. Lee’s work in [8] presents a 8 × 8 multistandard IDCT architecture based on delta coefficient matrices which can support VC-1, MPEG4, and H.264. It can process up to 21.9 fps for full HD video. Kim’s work in [9] describes a design following similar approach of [8] to unify the IDCT and inverse quantization (IQ) operations for those three codecs. However, the design cannot support full HD video format. Qi’s work in [10] shows an efficient integrated architecture designed for multistandard inverse transforms of MPEG-2/4, H.264, and VC-1 using factor share (FS) and adder share (AS) strategies for saving circuit resource. The work achieves 100 MHz working frequency for full HD video resolution, but does not support AVS. In another interesting design [11], the authors devise a common architecture by sharing adders and multipliers to perform transform and quantization of H.264, MPEG-4, and VC-1. The common shortcoming of all these designs discussed in [811] is that none of them supports the Chinese standard, AVS, nor the HEVC.

In our previous work [12], we have developed a resource shared design using delta coefficient matrices which can compute the 8 × 8 IDCT of VC-1, JPEG, MPEG4, H.264/AVC, and AVS. But due to complex data scheduling and the integration of JPEG (which is an image codec), the decoding capability is limited. The design supports both HD formats, but fails to comply with super resolution (WQXGA). Liu [13] introduces another design to support multiple standards where the design throughput is low (110.8 MHz) and cannot decode HD and WQXGA video. Fan’s works in [14, 15] are based on another efficient matrix decomposition algorithm to compute multiple transforms; however, the work is limited to only H.264 and VC-1. There are similar works in [1618], which are also limited to these two codecs (H.264 and VC-1).

In this paper, we present a generalized low-cost algorithm and its single chip implementation to compute all four modern video standards (AVS, H.264, VC-1, and HEVC). The design meets the requirement of high performance video coding as it can process the HD video at 145 fps, the full HD video at 62 fps, and the WQXGA video at 32 fps. The proposed scheme can be applied to both forward and inverse transformation; however, here we only show the implementation for the inverse process (targeted for decoders).

3. Proposed Generalized Algorithm for 8 × 8IDCT

In a video compression system, the transform coding usually employs an 8-point II-type DCT. Since, the forward DCT uses the same basis coefficients and is the transpose of the IDCT matrix, the proposed IDCT scheme is easily applicable to it without any added cost or complexity. The 8-point 1D forward and inverse DCT coefficient matrices are expressed in general form as and respectively (below in (1), where, denote seven different transform coefficients):

In this paper, we have denoted the 8 × 8 IDCT transform matrices for AVS, VC-1, H.264/AVC, and HEVC by the letters , , and respectively. These seven coefficients () for each of the transforms are different, but integer in nature (as shown in Table 1).

3.1. Development of a Generalized “Decompose and Share” Algorithm

First of all, we derive a generalized matrix decomposition scheme by utilizing the symmetric structure of the matrices and factoring the 8 × 8 matrix into two 4 × 4 sub-matrices as shown below: where  

The computational complexity of is only 8 additions. To reduce the complexity of , we use permutation techniques by performing the operations: .

Where 

There is no computational cost for as it only permutes the input data set (just needs rewiring). can be further decomposed into two 4 × 4 submatrices, and , by the direct sum operation (“”) as shown below: Thus, where  

Equation (6) forms the general expression of (1). We will use and as the basic building blocks to compute other 8 × 8 IDCTs. Since, the coefficients in and are fixed, they can be independently implemented, enabling fast computation.

In the following section, we show how (6) can be applied to different IDCT matrices. Another new feature of the proposed scheme is that we take the advantage of the similarity in matrix operation to further optimize the implementation. First of all, we apply (6) to efficiently implement the transformation matrix of AVS. Based on it and the generalized structure, we develop the matrix of VC-1 so that we can share as many units (from AVS) as possible. Next, we develop the IDCT matrix of H.264 based on the same principle (decompose and share from AVS and VC-1). In this stage, we are able to achieve the maximum sharing as it will be shown later (in Section 3.4) that the implementation of H.264 does not cost any extra hardware. Finally, we develop the IDCT of HEVC by further decomposing and reusing the units already implemented (with a minimum addition of extra units).

3.2. Matrix Decomposition for AVS

Let us now construct (from (1) and Table 1) and apply (6) to compute the submatrices, and . We then right shift by three bits and decompose it as follows: where  

Like , the computational cost of is only 4 additions. For , we implement as —that is right shift (arbitrary data) by two bits and then add with . So, the cost is 6 add and 6 shift operations. Thus in (8), the total computational cost is 10 addition and 6 shift operations. In similar way, we can decompose as shown below: where  

For both and , the coefficient can be shared and the cost is: 12 additions and 4 shift operations for ; 8 additions and 4 shift operations for . From (8)–(10), we can summarize the final expression of the 8 × 8 IDCT for AVS as:

Thus, the total computational cost to implement is 38 additions and 26 shift operations. In the next section, we will apply (6) to VC-1 and subsequently decompose the matrix in a way so that we can reuse the units already developed for the AVS (from (12)).

3.3. Matrix Decomposition for VC-1

We follow the same principles, as discussed in (8) and (10), to decompose the IDCT for the VC-1: where

Now considering the symmetric property and the coefficient distribution patterns between (in (8)) and , we decompose as: where

From (16), (15) can be reexpressed as:

Now it can be seen how the implementation of AVS matrix (from (12)) can be reused in (17). This matrix decomposition enables hardware sharing and results in significant saving in implementation resources. From (17), the total cost of and is 8 additions and 6 shift operations.

Next based on our careful observation between the computational similarities between (in (10)) and , we devise the decomposition scheme of as: where ,

By substituting (19) in (18), is expressed as:

Note that in (19) is structurally similar to in (10) except the change in the diagonal coefficients. So we only need to implement it; the rest is shared from the architecture of . We do so by adding 4 multiplexers at the output of the four left diagonal elements of matrix. Then according to (19), we reuse to compute . As the new matrix can be derived from by rearranging the rows and changing the polarity of some input bits, we share it from the design of by adding 4 multiplexers only. Finally, the expression of and from (17) and (20) are substituted in (13) to get the final expression of the IDCT for VC-1:

It is seen from (21) that to implement , the only new unit that is required is ; the rest is shared from the implementation of AVS (from (12)). So, the total computational cost for VC-1 is 12 additions and 10 shift operations.

3.4. Matrix Decomposition for H.264/AVC

Following similar procedure illustrated in the two previous sections, we can simplify the 8 × 8 transformation matrix for H.264/AVC as shown below: where  

In order to ensure the maximum unit sharing, we decompose as below: where  

In (24), is directly reused from (12). To share from the architecture of we simply add two multiplexer units. So there is no additional cost in terms of adders and shifters to compute . Similarly, we can decompose as: where  

Here is directly reused from (21) and we share from the architecture of . In this sharing we do not even need to use any multiplexers, because we have already done so while sharing from in Section 3.3. The final expression of the 8 × 8 IDCT for H.264 (with all shared units) can be summarized as follows:

It is interesting to note that all terms in (28) are implemented from the terms of (12) and (21); thus, in the proposed scheme, there is no additional cost to implement the IDCT for H.264 which results in significant hardware savings.

3.5. Matrix Decomposition for HEVC

In this section, we develop the transformation matrix for the HEVC based on the principles described before. The 8 × 8 matrix can be decomposed as: where  

The computational cost of is 4 additions and 4 shift operations. Here the coefficient is factorized as (). So the cost of in (30) is 4 additions and 8 shift operations. Similarly, we decompose as: where  

Combining (29)–(32), we compute the proposed IDCT for HEVC as given below:

In (34), only the new matrices, and , will be implemented and the rest will be shared from (12). So the total computational cost to implement in the proposed design is 24 additions and 28 shift operations. It is important to note that we have carefully decomposed so that if there is any change in the final standard, all one needs to do is to update (30) and (32) with new parameters without interrupting the entire design. In summery, the proposed unified design costs 74 additions and 64 shift operations to perform the inverse transformation of four defined video standards.

4. Hardware Implementation of the Shared Architecture

In the implementation of the multistandard architectures on a single platform, we have shared the entire hardware unit of the 4 × 4 matrices, instead of sharing individual adders, shifters, or other factors (as done in [10]). It ensures maximum reduction of hardware cost in our design. The overall block diagram of our proposed scheme is shown in Figure 1. We can see from Figure 1 that the block splits the 8-point decomposition process to two independent 4-point processes; since these two processes work concurrently, the design throughput is highly increased. The blocks , , and perform different operations (shared) as shown in Table 2.

Figure 2(a) shows the design of the serial to parallel converter (S2P) block. It performs left shift and then stores the input one by one into eight registers in 8 clock cycles, and at the 9th cycle, all stored input samples are sent to next block, . Here the S2P block apparently functions like a temporary memory buffer as it stores the rows of the input matrix inside eight registers. As a result, the proposed design does not require additional memory architecture. The wrapper architecture () is shown in Figure 2(b). In this multicodec system, only one IDCT and its associated computational units are activated at a time by the control unit and the select pin (Sel); the rest is disabled. The other blocks are shown in Figure 3. In different stages of the design, several multiplexers are used to ensure proper computation of the IDCT in operation. Finally, the block combines two different set of data and generates one output. In Figure 3, In0, In1, , In3 represent the inputs coming from the previous block and Out0, Out1, , Out3 represent the outputs going to the next block. As an example, in Figure 3(c) for the shared design of /, the inputs are coming from / subblock and the outputs are going to block.

The state diagram of the control unit is shown in Figure 4. Here, “” is reset and “” is a 3-bit internal counter run by the system clock. There are one reset and four active states. The states of the control signals are also shown in the diagram; for example, in state 1 (S1), S2P is storing the input vector while the output wrapper ( block) enables from MUX1 and from MUX2. Table 2 shows the units that are active depending on the status of the select pin. For example, the select signal will be “00” when the user wants to perform the IDCT of AVS codec. In that case, , , and will function as 2·, , and , respectively (the rest is inactive as found in (12)).

5. Performance Analysis and Comparisons

The proposed design is implemented in Verilog and its operation is verified using Xilinx Vertex4 LX60 FPGA. The total number of LUTs needed for this proposed architecture is 2,242. The design is later synthesized using 0.18 μm CMOS technology. The architecture costs 39.3 K gates and 12.15 K standard cells with a maximum operating frequency of 200.8 MHz. The estimated power consumption is 29.9 mW with 3 V supply.

In order to demonstrate the sharing efficiency, we have compared the adder count of our design with the 8-point standalone IDCT matrices of three standards: AVS, VC-1, and H.264/AVC (as presented in [12]). The results are shown in Figure 5. As of today, there is no implementation of the 8 × 8 IDCT of HEVC; thus, we have implemented it separately for the sake of better comparison. Now, we can see from Figure 5 that a total of 104 adders is required to implement these four transforms without sharing. The proposed shared design can compute all of them with 28.9% less adders. Moreover, the savings achieved in individual standards due to the sharing are also marked on the figure.

It is important to note that, though the proposed design costs 38 adders to implement AVS, it does not cost any additional adder units to implement H.264. Hence, AVS and H.264 combined together cost only 38 adders (compared to 48 for standalone implementations). The cost of implementing shift operation is considered insignificant in the computation. In Table 3, we compare the cost of the proposed scheme with available existing designs in the literature. None of the designs in this table supports HEVC (which is computationally expensive due to large matrix parameters as shown in Table 1). Although, the designs in [10, 12] cost fewer adders, it is shown later that the proposed scheme outperforms it in decoding capacity. Considering the fact that, the proposed architecture can decode the IDCT for four video codecs, it consumes the least number of adders compared to others.

In Table 4, we have summarized the performance in terms of gate count, maximum working frequency, and standard support with other designs. Only the design in [19] has frequency closer to us, but it supports only H.264. Similarly, designs in [11, 14] support only two codecs and accordingly cost lesser hardware than ours. Among other designs [810, 12, 13] are comparable to our design as they support as many as three codecs. While working at maximum capacity, the proposed design can process 200.8 million pixels/sec.

In order to have a better assessment among comparable designs (e.g., minimum support of three codecs), in Table 5 we compare the decoding capability (using 4 : 2 : 0 luma-chroma sampling) of the proposed approach with that of [810, 12]. In our work, the maximum achieved frame rate of a 1080 p video is = 200.8 × 106/(1920 × 1080 + 2 × 960 × 540) = 64.56 64 fps, which is the highest compared to all other designs in Table 5. Considering the current trends to use super resolution monitors, in this table we have also compared the decoding capabilities for the Wide Quad eXtended Graphics Array (WQXGA, with resolution of 2560 × 1600 pixels). Thus, it can be seen that the proposed design cannot only decode AVS, H.264/AVC, VC-1, and HEVC videos, but also can maintain relatively higher operational frequency to meet the requirements of real time transmission (the target fps to transmit HD, full HD, and QWXGA video are 120, 60, and 30, resp.). From the performance analysis, the scheme is found to be competitive as it can transmit the highest number of frames per seconds and, hence, takes the least time to transmit one frame at a given resolution.

6. Conclusion

In this paper, we present a generalized algorithm and a hardware-shared architecture by using the symmetric property of the integer matrices and the matrix decomposition to compute the 8-point 1-D IDCT for four modern video codecs: H.264/AVC, VC-1, AVS, and HEVC (draft in stage). The architecture is designed in such a way that can accommodate any change in the final release of the HEVC. We first apply the generalized scheme to AVS-based transform unit, and then gradually build the rest of the transform units on top of another to maximize the sharing. The performance analysis shows that the proposed design satisfies the requirement of all four codecs and achieves the highest decoding capability. Overall, the architecture is suitable for low-cost implementation in modern multicodec systems.

Acknowledgment

The authors would like to acknowledge the Natural Science and Engineering Research Council of Canada (NSERC) for its support to this research paper.