Mathematical and Computational Topics in Design StudiesView this Special Issue
A Systematic Hardware Sharing Method for Unified Architecture Design of H.264 Transforms
Multitransform techniques have been widely used in modern video coding and have better compression efficiency than the single transform technique that is used conventionally. However, every transform needs a corresponding hardware implementation, which results in a high hardware cost for multiple transforms. A novel method that includes a five-step operation sharing synthesis and architecture-unification techniques is proposed to systematically share the hardware and reduce the cost of multitransform coding. In order to demonstrate the effectiveness of the method, a unified architecture is designed using the method for all of the six transforms involved in the H.264 video codec: 2D 4 × 4 forward and inverse integer transforms, 2D 4 × 4 and 2 × 2 Hadamard transforms, and 1D 8 × 8 forward and inverse integer transforms. Firstly, the six H.264 transform architectures are designed at a low cost using the proposed five-step operation sharing synthesis technique. Secondly, the proposed architecture-unification technique further unifies these six transform architectures into a low cost hardware-unified architecture. The unified architecture requires only 28 adders, 16 subtractors, 40 shifters, and a proposed mux-based routing network, and the gate count is only 16308. The unified architecture processes 8 pixels/clock-cycle, up to 275 MHz, which is equal to 707 Full-HD 1080 p frames/second.
Video coding standards commonly use transform coding techniques—discrete cosine transforms (DCTs) are widely used in image and video compression standards, such as JPEG , MPEG-1/2 [2, 3], and MPEG-4 . Unlike the DCTs used in previous standards, H.264  uses integer transform matrices for coding, so there is no mismatch between the forward and inverse transforms [6, 7] and the complexity is significantly less than that for a DCT. H.264 also provides a specific transform for each prediction mode, and blocks of size 16 × 16 down to 4 × 4 pixels can be used for motion prediction. The prediction modes are organized in a tree-structured manner, which allows flexible combination of different motion compensation block sizes inside a 16 × 16-pixel macroblock. Therefore, H.264 achieves better compression, but it also requires an enormous number of computations.
H.264 requires the computation of three transforms, 8 × 8, 4 × 4 integer transforms, and 4 × 4 Hadamard transforms used in the luma component, and two transforms, 4 × 4 integer transforms and 2 × 2 Hadamard transforms, for the chroma components. Every transform of H.264 needs a corresponding hardware implementation, which results in a high hardware cost for all of the transforms involved. In recent years, some transform architectures for H.264 encoder/decoder that reduce the hardware cost have been proposed. In general, for low cost multiple transforms of video codecs, hardware sharing is the most suitable technique [8–26]. In [8–13], all of the 4 × 4 transforms are realized as a shared architecture. In , 2D 4 × 4 and 2 × 2 transforms are implemented in a FPGA and integrated in a multicore embedded system. In [15, 16], pipeline shared architectures for the 8 × 8 forward/inverse integer transform are demonstrated. For decoder use only, [17, 18] demonstrate transform processors for 8 × 8 and 4 × 4 inverse integer transforms and a 4 × 4 Hadamard transform. A 2 × 2 Hadamard transform has even been embedded into a shared architecture in [19, 20]. In [21–23], inverse transforms for a multistandard decoder are proposed. In , a cost-sharing architecture for an 8 × 8 integer cosine transform is proposed, which supports multiple video encoders. In , a unique kernel for multistandard video encoder transforms is presented and a configurable butterfly array (CBA) is also proposed, which supports both the forward transform and the inverse transform in the unified architecture of the multistandard video encoder in .
Although these studies demonstrate combined architectures for multiple transforms, a single architecture has not been designed for the whole set of forward and inverse transforms for H.264 encoder and decoder. Therefore, this study designs a unified architecture for the complete transform functionality of a H.264 codec, while still maintaining the low cost and high speed characteristics. In addition, sharing the hardware for the same operations reduces the hardware cost, and these studies use a hardware sharing technique to reduce hardware cost. However, no systematic hardware sharing method has been proposed in the literatures. This paper proposed a novel method that includes a five-step operation sharing synthesis (FOSS) and architecture-unification techniques, to systematically share the hardware and reduce the cost of multitransform coding.
In order to design the low cost unified architecture, a hardware sharing method is proposed, as shown in Figure 1. Firstly, six transform architectures are designed for low cost, using the proposed five-step operation sharing synthesis (FOSS) technique. Secondly, these low cost FOSS architectures are merged into a single architecture, using the proposed architecture-unification technique. The details of these techniques are described in the later sections. Section 2 describes the FOSS architecture design that reduces the hardware cost for each H.264 transform. Section 3 demonstrates the unification of all the low cost transform FOSS architectures into a single architecture, to eliminate the redundant hardware. The complexity and performance of the unified architecture are analyzed in Section 4, and Section 5 concludes the paper.
2. The Five-Step Operation Sharing Synthesis Technique
This paper firstly describes a five-step operation sharing synthesis (FOSS) technique and demonstrates the effectiveness of this technique by building low cost 1D inverse integer transform and 2D transform architectures. The procedure for the FOSS technique is as follows.
Step 1. Put rows with the same coefficients into the same group.
Step 2. Determine the same operations in each group.
Step 3. Determine the same operations between groups.
Step 4. Replace multiplication by addition and shift.
Step 5. Map each shared item to an operation node.
The basic idea is to systematically synthesize an architecture that shares all of the same operations in a matrix multiplication to reduce the cost of hardware.
2.1. FOSS Architecture for a 1D 8 × 8 Inverse Integer Transform
Taking 1D inverse transform as an example, a low cost architecture, called FOSS architecture, is designed using the proposed FOSS technique. The 1D inverse integer transform is defined aswhere is an inverse integer transform matrix and is an pixel block. is the transpose matrix of and is an matrix output of a 2D inverse integer transform. Since the 2D transform is separable, by using the column-row decomposition method, the computation can be converted to 1D row transform followed by a 1D column transform. These operations can be represented as follows:
Therefore, a 2D transform can be calculated using two steps. The first step is a 1D transform and the second step is the other 1D transform . The first row of , ~, multiplied by the inverse transform matrix equals the first row of , ~, as shown in the following:
The matrix multiplication in (4) requires 64 multiplications and 56 additions. This paper proposes a novel operation sharing synthesis technique to reduce the hardware cost, and the effectiveness of this technique is demonstrated by applying this technique to a 1D inverse integer transform.
Step 1. Put rows with the same coefficients into the same group, to determine the same operations in the next step more easily:
Step 2. Determine the same operations in each group and mark the shared operations using “(),” as shown in the following:
Step 3. Determine the same operations between groups and mark the shared operations using “(),” as shown in the following:
Step 4. Replace multiplication by addition and shift. If the coefficient is a second power, shift replaces multiplication. For the other coefficients, addition, subtraction, and shift are all needed to replace multiplication. The shared operations are indicated using “(),” as shown in the following:
The computation complexity for a 1D inverse transform is reduced from 64 multiplications and 56 additions in (4) to 24 additions, 16 subtractions, and 18 shift operations in (4), which can be directly mapped to a low cost hardware architecture.
Step 5. Map each shared item to an operation node, where ~ and ~ are the inputs and outputs of the architecture, respectively, and ~, ~, and ~ represent the shared nodes. Only four stages are required for the architecture, from input to output. Nodes ~, ~, ~, and ~ are in the 1st, 2nd, 3rd, and 4th stages, respectively. The operations for each stage of the 1D inverse integer transform are summarized as follows.
Stage 1. Consider
Stage 2. Consider
Stage 3. Consider
Stage 4. ConsiderUsing these operations for the four stages, the FOSS architecture for a 1D inverse integer transform is obtained as shown in Figure 2. In Figure 2, the “+” sign on the node represents an addition and the node with the “−” sign represents a subtraction. An arrow represents the data flow. If an input coefficient or a coefficient of the arrows is a second power, a shift operation is used. Note that the FOSS architecture for a 1D inverse integer transform processes 8 input pixels and outputs 8 transformed data in parallel.
2.2. The FOSS Architecture for a 2D 4 × 4 Inverse Integer Transform
Take 2D inverse integer transform as the other example to demonstrate more clearly the proposed FOSS technique. The H.264 inverse integer transform is defined aswhereAlthough 2D transforms are separable, the column-row decomposition method is not used in this study. Instead, a direct 2D transform method is used to eliminate the use of a transpose register array, in order to reduce the latency. Firstly, in (9) is replaced by (10), to produce the following transform matrix:
The inputs for the FOSS architecture are ~. The outputs are ~ and ~ as stage 1 performs additions, and the outputs are ~, ~ when stage 1 performs subtractions.
Similarly, the FOSS architectures for a 2D Hadamard transform, a 2D forward transform, a 2D Hadamard transform, and a 1D forward transform are designed in the same procedure.
3. Architecture-Unification Technique
When all six FOSS architectures of H.264 transforms have been obtained, the proposed architecture-unification procedure described in the following is then used to construct a shared architecture for the FOSS architectures.
3.1. Count Decisions of the Unified Architecture Stages and the Shared Nodes
The unified architecture must have the largest stage count of all of the FOSS architectures, because every stage in each FOSS architecture must correspond to a stage in the unified architecture (e.g., Figure 4). Figures 4(a) and 4(b) show FOSS architectures with four and three stages, respectively. The stage count for a unified architecture must be the largest stage count of the two FOSS architectures, if the two FOSS architectures are to be unified. Figure 4(a) has four stages and Figure 4(b) has three stages, so the unified architecture must have four stages, as shown in Figure 4(c).
The stage counts for all FOSS architectures are not identical. In order to unify FOSS architectures with different stage counts, every FOSS architecture shares operation nodes from stage one of the unified architecture. For example, Figure 4(a) shows a four-stage FOSS architecture, where the shared stages are from stage one to stage four and Figure 4(b) is a three-stage FOSS architecture, where the shared stages are from stage one to stage three, as shown in Figure 4(c).
In the unified architecture, the count for the shared nodes in stage must be the maximum count for the operation nodes of stage among all of the FOSS architectures, as shown in Figure 5. In Figure 5(a), stage one has four nodes and stage two has three nodes, and in Figure 5(b), stage one has three nodes and stage two has four nodes. Therefore, both stage one and stage two have four nodes in the unified architecture, as shown in Figure 5(c).
3.2. Node Sharing in the Unified Architecture for All of the FOSS Architectures
If transforms are to be unified, one of the two inputs of a node in the unified architecture can have up to different input paths. In order to reduce the multiplexer overhead for input path switching, the number of input paths must be minimized. In order to achieve this goal, the nodes of the unified architecture are shared stage by stage, and every node of a stage is compared for all architectures, using the following procedure.
Step 1. If the nodes have two same input paths and the same operation, go to Step .
Step 2. If the nodes have two same input paths but different operations, go to Step .
Step 3. If the nodes have only one same input path and the same operation, go to Step .
Step 4. If the nodes have only one same input path but different operations, go to Step .
Step 5. If the nodes have one or two same input paths with opposite position and their operations are both addition, go to Step .
Step 6. If the nodes have not the same input path and have the same operation, go to Step .
Step 7. If the nodes have two different input paths and different operations, go to Step .
Step 8. The nodes share a node of the unified architecture.
Figure 6 shows an example of how a node is shared, for the 4 FOSS architectures. Figure 6(a) shows the input paths and the operation nodes for a stage for the 4 FOSS architectures. The transform, , shown in Figure 6(a) is used to label each node, from top to bottom (~), as shown in Figure 6(b). The nodes of the 4 FOSS architectures (~) share a node, using this procedure, and the nodes that share a node use the same label, as shown in Figure 6(b), which is also the number of the shared nodes in the unified architecture, as shown in Figure 6(c). Note that if any one of the inputs of a node in Figure 6(c) has multiple input paths for different transform modes, a multiplexer is required, to select a corresponding input path.
3.3. Operation and the Bit-Width Decisions for a Shared Node
In order to perform the operations of the correspondent nodes for all FOSS architectures, it is necessary to determine the operation for each shared node in the unified architecture.
If the operations for the nodes of the FOSS architectures that share a node are all additions or all subtractions, the operation for this shared node of the unified architecture is addition or subtraction, respectively. If some operations for the nodes of the FOSS architectures that share a node are additions and some operations are subtractions, the operations for this shared node are both subtraction and addition. As shown in Figure 7, nodes are all adders in all of the FOSS architectures, so a node is an adder in the unified architecture. Similarly, is a subtractor. Because the inputs for the correspondent operation nodes in different FOSS architectures are seldom shifted with the same bits, it is not efficient to share a shift operation in the unified architecture.
In order to determine bit-width for a shared node of the unified architecture, bit-widths of the nodes that share a node must be determined first. In order to determine the bit-width for a node of a FOSS architecture, both input and output bit-widths for the FOSS architecture must be checked in the video standard specification, as shown in Table 1. The input bit-width for a node in the current stage is then determined according to the output dynamic range that results from addition, subtraction, and the shift of the nodes in the previous stage. Note that the dynamic range accumulates stage by stage from input to output, for a FOSS architecture. For the unified architecture, the bit-width of a shared node is determined by the largest bit-width of all of the nodes that share a node. As shown in Figure 8 the largest bit-widths of input and output of the nodes in all of the FOSS architectures are both 16 bits, so the input and output bit-widths of the node in the unified architecture are both 16 bits.
3.4. The Design of a Low Cost Multiplexer Design for the Mux-Based Routing Network
To share a node of the unified architecture for the operations of multiple transforms, additional multiplexers are required to route a correspondent input path for individual transform mode. If transforms share a node, each input of a node has a maximum of different input paths and two additional -to-1 multiplexers are deployed in front of each node. If an input path is -bit wide, an input requires a -bit, -to-1 multiplexer. As shown in Figure 9, one input of the shared node is 8-bit wide, so an 8-bit 6-to-1 multiplexer is required to route one of the 6 input paths to the input of the shared node. If a 1-bit 2-to-1 multiplexer requires a hardware unit, as shown in Figure 10(a), a 1-bit 6-to-1 multiplexer requires 5 hardware units as shown in Figure 10(b). Because an 8-bit 6-to-1 multiplexer requires eight 1-bit 6-to-1 multiplexers, an 8-bit 6-to-1 multiplexer requires 40 () hardware units.
In order to reduce the hardware cost and to alleviate routing congestion in a VLSI physical design for a mux-based routing network for the unified architecture, the number of input paths for a shared node must be minimized. The multiplexer in Figure 9 is redesigned as a low cost and low routing congestion multiplexer, as shown in Figure 11. In Figure 9, there are two input paths, and , for one input of a shared node, so an 8-bit 2-to-1 multiplexer is used to select input path or . In addition, a 1-bit 6-to-1 multiplexer is used to select 0 or 1 for the selected line of the 8-bit 2-to-1 multiplexer. Using an 8-bit 2-to-1 multiplexer requires 8 hardware units and a 1-bit 6-to-1 multiplexer requires 5 hardware units, giving a total of 13 () hardware units. The cost of the multiplexer in Figure 11 is 67.5% that of the multiplexer in Figure 9.
The multiplexers required for a shared node result not only in extra hardware cost, but also in routing congestion for the unified architecture. The number of input paths for a 1-bit 6-to-1 multiplexer is 11, as shown in Figure 10(b), and that for an 8-bit 6-to-1 multiplexer is 88, as shown in Figure 9, which could cause clustering of the routing wires in particular areas. The routing congestion can be mitigated using the proposed multiplexer design, wherein the number of input paths is only 33, as shown in Figure 11. In addition, a good floor plan that uniformly distributes the multiplexers around the chip can disperse the routing wires. Note that additional latency is incurred by the multiplexers of the routing network that serialize the operations for different transform modes.
3.5. The Architecture-Unification Technique for All the H.264 Transform FOSS Architectures
The architecture-unification technique consists of count decisions for the unified architecture stages and the shared nodes, node sharing for the unified architecture for all FOSS architectures, the operation and the bit-width decisions for a shared node, and the design of a low cost multiplexer for the mux-based routing network. When this process is used for all of the H.264 transform architectures designed using the FOSS technique, Figure 12 shows the unified architecture for 2D Hadamard transform, 2D forward transform, 2D inverse transform, 1D forward transform, and 1D inverse transform FOSS architectures.
4. Complexity and Performance Analysis
4.1. The Computational Complexity of the Original Transforms, the FOSS Architectures, and the Unified Architecture
The computational complexities of the original transforms, the FOSS architectures, and the unified architecture are analyzed. The computational complexities of the original transforms and the FOSS architectures are compared in Tables 2 and 3. As shown in Table 2, a total of 844 adders and 912 multipliers are needed for all of the six original H.264 transforms. However, only 24 sub/adders, 96 adders, 84 subtractors, and 60 shifters are needed for all of the six FOSS architectures that reduce the cost by sharing the same operations and by replacing the multipliers by adders and shifters, as shown in Table 3. The computational complexities of the FOSS architectures and the unified architecture are compared in Tables 3 and 4. Table 4 shows the computational complexity of the unified architecture, which only requires 28 adders, 16 subtractors, and 40 shifters and needs no sub/adders.
4.2. The Hardware Cost and Performance of the FOSS Architectures and the Unified Architecture
In order to implement the unified architecture, a front-end cell-based design flow is employed for logic design, simulation, and verification of VLSI implementation. The proposed FOSS architectures and the unified architecture are firstly realized using Verilog RTL code, and then a ModelSim EDA tool is used for the RTL functional simulation. Synopsys Design Compiler is used for logic synthesis and the standard cell library used is the UMC 0.18 μm Artisan cell library. The hardware cost and the performance of the FOSS architectures and the unified architecture are analyzed. The gate count is calculated using
Because transform matrix multiplication uses different hardware architectures, the gate counts for the original transforms are not known. Table 5 shows the gate counts for the FOSS architectures and Table 6 shows those for the unified architecture, which has the gate count that is 36% less than the total gate count of all of the six FOSS architectures. Because the six FOSS architectures share the unified architecture, using the proposed architecture-unification technique, the number of gates saved for an individual transform is not known, but the total number of gates saved is the only way to measure the total reduction in hardware cost that results from the unification technique.
Since there is no register in either the FOSS architectures or the unified architecture, the critical path is the longest path of all of the paths from the input to the output of the architecture. The frequency is the reciprocal of the critical path. Because of the delay due to the mux-based routing network, the critical path for the unified architecture is slightly longer than that for any FOSS architecture. In other words, the frequency of the unified architecture is lower than that of any FOSS architecture. The frequency range of the FOSS architectures is from 285 MHz to 385 MHz, while the unified architecture still processes up to 275 MHz, as shown in Tables 5 and 6. Since there is no register in either the FOSS architectures or the unified architecture, the latencies are all one clock cycle. The throughput for the 2D Hadamard transform FOSS architecture is 4 pixels/cycle, but those of the other transform FOSS architectures are all 8 pixels/cycle.
In this paper, a systematic hardware sharing method that allows a unified architecture for H.264 transforms is presented. A FOSS architecture design technique is presented to reduce the hardware cost for each H.264 transform. The basic idea is to systematically synthesize architecture, to share all of the same operations in a matrix multiplication, and to allow a reduction in hardware cost. In total 844 adders and 912 multipliers are required for all of the six H.264 transform matrix multiplications. However, only 24 sub/adders, 96 adders, 84 subtractors, and 60 shifters are required for all of the six FOSS architectures, which reduces the cost by sharing the same operations and by replacing all of the multipliers by adders and shifters. When all of the six FOSS architectures of the H.264 transforms are determined, an architecture-unification design flow is then proposed that unifies all of the low cost transform FOSS architectures into a single architecture to eliminate the redundant hardware. The unified architecture only requires 28 adders, 16 subtractors, 40 shifters, and the proposed mux-based routing network. The gate count for the unified architecture is 16308, which is 36% less than the total gate count for all of the six FOSS architectures. The frequency range of the FOSS architectures is from 285 MHz to 385 MHz, while the unified architecture still processes up to 275 MHz. Since there is no register in either the FOSS architectures or the unified architecture, the latencies are all one clock cycle. Throughput for the 2D Hadamard transform is 4 pixels/cycle, but those of the other transforms are all 8 pixels/cycle. In addition, the proposed hardware sharing method can also be used to construct a unified architecture for multitransform coding of other international video coding standards, such as VC-1, MPEG-1/2/4, and even the next generation high efficiency video coding (HEVC) that allows a reduction in hardware cost.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
“Coding of still pictures,” ISO/IEC JTC 1/SC 29/WG 1, 2009.View at: Google Scholar
Video Coding Standard, “Information technology—coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s—part 2: video,” ISO/IEC 11172-2 MPEG-1, 1993.View at: Google Scholar
“Video coding standard, information technology—generic coding of moving pictures and associated audio information: video,” ISO/IEC 13818-2 MPEG-2, 1995.View at: Google Scholar
Video Coding Standard, “Information tchnology—coding of audio-visual objects—part 2: visual,” ISO/IEC 14496-2 MPEG-4, 2004.View at: Google Scholar
S. Gordon, D. Marple, and T. Wiegand, “Simplified use of 8 × 8 transforms—updated proposal and results,” in Proceedings of the 11th Meeting JVTK028, Munich, Germany, March 2004.View at: Google Scholar
T. Wang, Y. Huang, H. Fang, and L. Chen, “Parallel 4×4 2D transform and inverse transform architecture for MPEG-4 AVC/H.264,” in Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 800–803, May 2003.View at: Google Scholar
Z.-Y. Cheng, C.-H. Chen, B.-D. Liu, and J.-F. Yang, “High throughput 2-D transform architectures for H.264 advanced video coders,” in Proceedings of the IEEE Asia-Pacific Conference on Circuits and Systems (APCCAS '04), pp. 1141–1144, December 2004.View at: Google Scholar
K.-H. Chen, J.-I. Guo, K.-C. Chao, J.-S. Wang, and Y.-S. Chu, “A high-performance low power direct 2-D transform coding IP design for MPEG-4 AVC/H.264 with a switching power suppression technique,” in Proceedings of the IEEE VLSI-TSA International Symposium on VLSI Design, Automation and Test (VLSI-TSA-DAT '05), pp. 291–294, April 2005.View at: Publisher Site | Google Scholar
H.-Y. Lin, Y.-C. Chao, C.-H. Chen, B.-D. Liu, and J.-F. Yang, “Combined 2-D transform and quantization architectures for H.264 video coders,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '05), pp. 1802–1805, Kobe, Japan, May 2005.View at: Publisher Site | Google Scholar
Y. K. Lai and Y. F. Lai, “A reconfigurable IDCT processor architecture for video coding applications,” in Proceedings of the International Conference on Manufacturing and Engineering Systems (MES '09), pp. 469–472, December 2009.View at: Google Scholar
Y.-C. Chao, H.-H. Tsai, Y.-H. Lin, J.-F. Yang, and B.-D. Liu, “A novel design for computation of all transforms in H.264/AVC decoders,” in Proceedings of the IEEE International Conference onMultimedia and Expo (ICME '07), pp. 1914–1917, July 2007.View at: Google Scholar