#### Abstract

This work presents a flexible VLSI architecture to compute the -point DCT. Since HEVC supports different block sizes for the computation of the DCT, that is, up to , the design of a flexible architecture to support them helps reducing the area overhead of hardware implementations. The hardware proposed in this work is partially folded to save area and to get speed for large video sequences sizes. The proposed architecture relies on the decomposition of the DCT matrices into sparse submatrices in order to reduce the multiplications. Finally, multiplications are completely eliminated using the lifting scheme. The proposed architecture sustains real-time processing of 1080P HD video codec running at 150 MHz.

#### 1. Introduction

As the technology is evolving day by day, the size of hardware is shrinking with an increase of the storage capacity. High-end video applications have become very demanding in our daily life activities, for example, watching movies, video conferencing, creating and saving videos using high definition video cameras, and so forth. A single device can support all the multimedia applications which seemed to be dreaming before, for example, new high-end mobile phones and smart phones. As a consequence, new highly efficient video coders are of paramount importance. However, high efficiency comes at the expense of computational complexity. As pointed out in [1, 2], several blocks of video codecs, including the transform stage [3], motion estimation and entropy coding [4], are responsible for this high complexity. As an example the discrete-cosine-transform (DCT), that is used in several standards for image and video compression, is a computation intensive operation. In particular, it requires a large number of additions and multiplications for direct implementation.

HEVC, the brand new and yet-to-release video coding standard, addresses high efficient video coding. One of the tools employed to improve coding efficiency is the DCT with different transform sizes. As an example, the 16-point DCT of HEVC is shown in [5]. In video compression, the DCT is widely used because it compacts the image energy at the low frequencies, making easy to discard the high frequency components. To meet the requirement of real-time processing, hardware implementations of 2-D DCT/inverse DCT (IDCT) are adopted, for example, [6]. The 2-D DCT/IDCT can be implemented with the 1-D DCT/IDCT and a transpose memory in a row-column decomposition manner. In the direct implementation of DCT, float-point multiplications have to be tackled, which cause precision problems in hardware. Hence, we propose a Walsh-Hadamard transform-based DCT implementation [7]. Then, inspired by the DCT factorizations proposed in [8, 9], we factorize the remaining rotations into simpler steps through the lifting scheme [10]. The resulting lifting scheme-based architecture, inspired by [11–13], is simplified, exploiting the techniques proposed in [9, 14] to achieve a multiplierless implementation. Other techniques can be employed to achieve multiplierless solutions, such as the ones proposed in [8, 15–18], but they are not discussed in this work. In this work, the proposed multisize DCT architecture supports all the block sizes of HEVC and is proposed for the real-time processing of 1080P HD video sequences.

The rest of this paper is organized as follows. Section 2 provides reviews of 2-D DCT transform. Section 3 shows the matrix decompositions for different DCT sizes. Section 4 presents the proposed hardware architecture. The VLSI implementation and the simulation results in Section 5. Finally, Section 6 concludes this paper.

#### 2. Review of 2D Transform

According to [19], the DCT in the so-called -form for an -point block of samples is obtained as where and . The same result expressed in (1) can be rewritten as the product of a matrix for as follws: where is the transposition operator. The DCT matrix can be expressed in terms of a reduced number of angles by exploiting the symmetry properties of the trigonometric functions. For 16 points, DCT matrix can be represented as where is shown in (20), with where and is an odd integer such that .

In [20], it is shown that every even-odd transform can be represented in terms of any other evenodd transform through a conversion matrix. In particular, in [7] it is shown that the DCT can be expressed in terms of the Walsh-Hadamard transform (WHT) [21], and the conversion matrix has a block diagonal structure. In [22], it is shown that WHT can be implemented with a fast algorithm based on a butterfly structure. Among the possible WHT matrix representations, the Walsh-ordered one is applied by deriving it from the corresponding Hadamard-ordered matrix. The -order Hadamard matrix can be expressed as where . The corresponding Walsh matrix is obtained by applying a two step procedure [23] as follows:(1)bit reverse the order of the rows of ,(2)gray coding to the row indices. Therefore, (4) can be written as where is the -point bit reversal matrix, and is a block diagonal matrix with a recursive structure, is the identity matrix and It is worth noting that can be factorized in terms of Givens rotations, where the Givens rotation matrix for a rotation angle is In particular can be decomposed in the product of permutation matrices and Givens rotation matrices with and as The permutation matrix is obtained by applying the following permutation: to the rows or to the columns of , the identity matrix. It is worth observing that can be defined recursively as where and .

The Givens rotations matrices can be described as follows: contains Givens rotations disposed in concentric squares with the rotation angle increasing from the outer square to the inner one. The definition of each is given in (5). In the outermost circle , whereas the value of increases from the outer square to the inner one. Since is the multiplied value in the numerator of (5), the angle of the Givens rotations increases from outer to inner squares. The general structure of is shown in (15). In the next sections, the expression shown in (15) will be detailed for different values of The remaining matrices , for , are where and . Finally, each Givens rotation can be factorized into lifting steps by the means of the lifting scheme [10] as suggested in [9] as follows:

#### 3. Matrix Decompositions for Different

Matrices for different DCT are derived using the factorization presented in Section 2. In the following paragraphs factorizations for ranging from 4 to 32 are explicitly shown.

##### 3.1. DCT

Equation (6) can be given as Equation (7) can be given as where where is shown in (9) and is a identity matrix. is a Walsh-Hadamard matrix which is calculated using(17)

##### 3.2. DCT

Equation (6) can be given as Equation (7) can be given as where where is shown in (9) and is a identity matrix. is a Walsh-Hadamard matrix which is calculated using (21). According to (11), and . So can be written as where, according to (14) and, according to (15), and , so Using (12) and (13) the permutation is computed as Finally the permutation matrix is obtained by applying the permutation, shown in (27), to the columns of identity matrix as,

##### 3.3. DCT

Equation (6) can be given as Equation (7) can be given as where where and are shown in (9) and (24). To calculate , according to (11), and . So can be written as where, according to (14), Similarly, to calculate , according to (15), and , so where is calculated in (25). For , and , so Using (12) and (13) the permutation is computed as Finally the permutation matrix is obtained by applying the permutation, shown in (37), to the columns of identity matrix as

##### 3.4. DCT

Equation (6) can be given as Equation (7) can be given as where where , and are shown in (9), (24), and (32). To calculate , according to (11), and . So can be written as where, according to (14), Similarly, to calculate , according to (15), and , so where is calculated in (33). For , and , so where is calculated in (25). For , and , so Using (12) and (13) the permutation is computed as finally the permutation matrix is obtained by applying the permutation, shown in (47), to the columns of identity matrix as

#### 4. Proposed Architecture

The complete hardware architecture of the DCT is shown in Figure 1. Each frame is loaded in the input frame memory. The complete frame is divided into blocks. The control unit reads the rows of each block from the input memory. At the same time, the control unit passes a “” to the input multiplexers. Also the address and other control signals are passed to the DCT block. After complete calculation of the DCT, the transformed row is input to the transpose memory, along with its corresponding address. In this way, for the first clock cycles, the rows from the input memory are input to the DCT and are written on the corresponding addresses in the transpose memory. After the clock cycles, the control unit passes a “” for the input multiplexers for the next clock cycles. So in this way, each column from the transpose memory is input to DCT block, and the outputs of the DCT block are written back in the transpose memory, on the same location from where they are read. When all the columns are read and processed by the DCT, the control unit again starts reading the next block from the input memory and at the same time, each row from the transpose memory is written to the output transformed memory. In this way all the blocks are read, processed, and written in the output transformed memory.

When the last row is processed through the DCT, it is written to the transpose memory. At the same time, the first column from the transpose memory is read in order to be processed through DCT block. As the last row was not written, so the last data of the first column is not valid. So “Data0” multiplexer is used for forwarding. In this way, the first output of last transformed row of a block is forwarded to the input to DCT and also written to the transpose memory.

##### 4.1. DCT Block

DCT block is the main block of the complete architecture. The DCT block takes the input data, the corresponding control signals, and the corresponding addresses. The internal architecture of the DCT block is shown in Figure 2.

DCT block has 4 pipeline stages. The data is passed through the Hadamard block. The Hadamard block is designed with a fully parallel architecture. The Hadamard block takes 32 data at its inputs and passes to the butterfly_32, while the first 16 are input to the butterfly_16, the first 8 are input to the butterfly_8, and the first 4 are input to the butterfly_4 as well. Multiplexers are placed at the inputs of each different size butterfly in order to have correct result from Hadamard block. The select signals for the multiplexers are controlled by the control unit. The Hadamard block has 32 outputs. To have a Walsh transform from a Hadamard one, the bit_reversal and gray_code blocks are placed after the Hadamard block.

In the bit_reversal block, the data at input port number is moved to output port number , where the output port number is determined by representation the input port number and reversing the bits. For example, in case of DCT_16, the Hadamard block will produce 16 valid outputs. So the 16 inputs to the bit_reversal block are shuffled according to bit_reversal rule, for example, means and the bit_reversal is also . So the first input port will be connected to the first output port. Similarly, for , the bit reversal will which means that the second input port is connected to the eight output port. In this way all the inputs are connected to the outputs according to bit reversal rule. As the architecture supports four different sizes of DCT, it means that the bit reversal rule will be different for each DCT size. For example, for DCT_4, will be connected to , that is, 2nd output port, while in case of DCT_16, it will be connected to the 8th output port. So multiplexers are placed in order to support all the DCT sizes in the bit reversal block.

Gray code block works in the same principle as bit reversal block, but according to gray code law. In gray code block, the output port is determined by applying gray code on the input addresses. For example, for DCT_32 if , . So the input port number 13 is connected with the output port number 11. Gray code calculation does not depend on the DCT size. For example for DCT_16, if , , which means that input port 13 is connected to the output port number 11, which is same as that for DCT_32.

The architecture of mem.block is shown in Figure 3. The memory block connects the first 16 inputs directly to the output ports, while the last 16 outputs are multiplexed with the latched inputs and the direct inputs. The last 16 outputs are used in case of DCT_32, while the last 16 last inputs are bypassed in case of DCT_4, DCT_8, and DCT_16.

The permutation block is implemented using (27), (36), and (46). The block takes 32 inputs, and it sends 16 of the inputs to the outputs according to the permutation law. The first 16 inputs to the permutation block are passed to the outputs in the first clock cycle, while in the second clock cycle the last 16 inputs of the permutation network are passed to the outputs. In case of , the selection line of the multiplexers is always set to “0”, while in case of , the selection line remains “0” in first clock cycle, while remains “1” in the next clock cycle.

The lifting scheme is implemented using (19), (23), (31), and (40). The lifting block is implemented with a folded architecture. Where the fully parallel lifting block is used for DCT sizes of 4, 8, and 16, while DCT_32, the block is reused. Each row of DCT_32 takes 2 clock cycles for completion. During the first clock cycle, the upper 16 inputs are processed by the lifting scheme and are stored in the memory block. In the next clock cycle, the lower 16 inputs are processed by the lifting block and the result along with the previously calculated stored values is forwarded to the next block. The lifting block is shown in Figure 4.

Lifting scheme is designed for 15 Givens rotations. The basic lifting structure, shown in Figure 5, is implemented using (16), where as suggested in [9].

The lifting structures takes two values at the inputs. For each Givens rotation, the , , , and are approximated integer values, in order to have the approximated DCT equal to the actual DCT. As and are integers, the multiplications are implemented using adders and shift operations. The result of each lifting structure is quantized to 16-bits resolution to have a reasonable PSNR value. So the final outputs of the lifting block are 16-bit wide. The results of some lifting structures are bypassed using the multiplexers at their outputs. In fact, the select line for the multiplexers will always remain “0” for DCT_4, DCT_8, and DCT_16, while for DCT_32, the select line remains “0” for first clock cycle and “1” the for the next clock cycle.

In case of DCT_32, the Hadamard block produces 32 results in parallel. The outputs of the Hadamard block are fed into the bit reversal block, gray code block, and in the following bit reversal block, and the output

of the bit reversal block is fed into the memory block. The memory block passes the upper 16 inputs directly to the permutation block, while the lower 16 inputs are stored in the registers. The permutation block forwards the upper 16 inputs to the lifting scheme, where the select lines for the multiplexers in the lifting scheme are set to “0”, and the lifting scheme outputs the results and the results are stored in the following memory block. In the next clock cycle the lower 16 values stored in the first memory block are passed to the lifting scheme, through the permutation block. The selection line for the multiplexers in the lifting scheme are set to “1”. The 16 results are calculated and are passed to the memory block. At the same time, the memory block forwards the previously stored values along with the new arrived ones.

In case of DCT_4, DCT_8, and DCT_16, the lower 16 values from the first memory blocks are invalid and never used. So the valid upper 16 inputs are fed into the lifting scheme via permutation network. The selection line for the lifting scheme multiplexers is always set to “0” in case of .

takes one clock cycle to calculate one row or one column, while takes 2 clock cycles. The outputs of the second memory block are passed to the third bit reversal block, passing through a fully parallel permutation network. The outputs of the bit reversal are then divided by square root of , where is the DCT size. The square roots are calculated as

The hardware architecture of the one square root block is shown in Figure 6. The input is divided by and the calculated values are fed into the output multiplexer, where the valid result is sent to output depending on the DCT size. Finally, the outputs from the square root block are quantized from 16 bits to 13 bits using the block.

##### 4.2. Transpose Buffer

Transpose buffer is designed using registers. The buffer is designed to support maximum DCT size, that is, DCT_32. So the buffer is of size , where and , where is the width of each data. So a total of 13 kbits memory is utilized to implement transpose buffer. The inputs of the buffer are the clock, reset, transpose signal, the row number, the column number, read enable signal, and the write enable signal. During the direct cycle, all the rows from the input frame memory are transformed through DCT block, and the results are stored on the corresponding rows in the transpose buffer. When all the rows of a input frame memory are transformed and written to the transform buffer, the columns of the transform buffer are read and the columns are transformed via DCT block, and the results are again written back to the transform buffer in transpose way, that is, on each column. When all the columns of the transform buffer are read, transformed, and written back to the buffer, the rows of the input frame memory are read row wise, and at the same time the rows of the transform buffer are written to the output frame memory. In this way, the complete frame is transformed and written to the output memory.

##### 4.3. Input MUXes Block

The DCT transforms the rows of the block and the results are stored in the buffer. So the select signal for the input MUXes block is set to “1”. After clock cycles, where is the size of DCT, the select signal of the input MUXes is set to “0”, so that the columns from the transform buffer is fed into the DCT block. So, input MUXes block switches the inputs for the DCT block for the direct or transformed cycles.

##### 4.4. Data0 Multiplexer

During the direct cycles, the rows from the input memory are transformed and the results are written in the transform buffer. When the last row from the input memory is transformed, first column is read in the next clock cycle from the transform memory. At this point, the last data of the first column is not the valid one, as the last transformed row has not yet been written to the memory. So data0 multiplexer is used for forwarding, where the first data out of the 32 transformed dates is selected from the data0 memory. The select line of the data0 multiplexer is set to “1” for just one clock cycle during transformation of block, that is, when the first column is read from transform buffer and the last row is transformed via DCT block, otherwise the select signal is always set to “0”.

##### 4.5. Control Unit

Control Unit controls the activities of all the blocks in each clock cycle. This unit is responsible for a correct sequence of operations. Control unit is designed using 4 memories, where each memory contains the control signals for each DCT size. There are 4 counters in the unit, where each counter produces the addresses for its corresponding memories. In response to the addresses, the memories output the control signals. The outputs of the memories are multiplexed, where the selection line of the multiplexer decides which input to go out. The hardware architecture of the control unit is shown in Figure 7. MEM_CU_4 is a size, MEM_CU_8 is a size, and MEM_CU_16 is a bits size while MEM_CU_32 is a size. The memories contains control signals for direct cycle and clock cycles for the transpose cycle, where is the DCT size. But MEM_CU_32 contains control signals, because each row or column takes two clock cycles for completion in case of DCT_32. So the control unit generates 128-bits wide control signals, for all the functioning blocks of the complete DCT in each clock cycle.

#### 5. Results

The computation of -point DCT by the means of the WHT factorization requires the following. (1) 2-inputs/2-output butterfly for the -point WHT. (2) 2-input/2-output lifting-based (3-lifting-step) structures for the Givens rotations.The WHT is implemented with fully parallel architectures using maximum number of resources, while the lifting scheme is implemented with a folded architecture. Hence, 80 2-input/2-output butterflies are used to implement WHT for . The number of adders required to implement the 80 2-input/2-output butterflies is 160. The total number of 2-input/2-output lifting structures to implement the 15 Givens rotation is 49, but with folded architecture we have reused the data path and reduced the number of lifting structures to 32.

The factorization of the matrices is applied to H.265 DCT. The lifting coefficients are approximated with the following condition: where is the -point DCT obtained from MATLAB function dctmtx, scaled with . Table 1 shows the approximated values, calculated from the conditions in (48), of all the coefficients and the number of bits required to normalize the results.

According to [14], multiplications for and can be implemented with a minimum number of additions resorting to the n-dimensional reduced adder graph (RAG-n) technique. The total number of adders required for all the coefficients is shown in Table 1.

The DCT block contains adders to implement the Hadamard block. Using Tables 1 and 2, the number of adders use to implement the DCT can be calculated. The first stage of lifting steps ( rotation) requires adders to implement the lifting structure, and further adders are required to implement all the steps involving and . The first stage of lifting steps ( rotation) requires adders to implement the lifting structure, and further adders are required to implement all the steps involving and . The first stage of lifting steps ( rotation) requires adders to implement the lifting structure, and further adders are required to implement all the steps involving and . The first stage of lifting steps ( rotation) requires adders to implement the lifting structure, and further adders are required to implement all the steps involving and . The first stage of lifting steps ( rotation) requires adders to implement the lifting structure, and further adders are required to implement all the steps involving and .

The first stage of lifting steps ( rotation) requires adders to implement the lifting structure, and further adders are required to implement all the steps involving and . The first stage of lifting steps ( rotation) requires adders to implement the lifting structure, and further adders are required to implement all the steps involving and . The first stage of lifting steps ( rotation) requires adders to implement the lifting structure, and further adders are required to implement all the steps involving and . The first stage of lifting steps ( rotation) requires adders to implement the lifting structure, and further adders are required to implement all the steps involving and . The first stage of lifting steps ( rotation) requires adders to implement the lifting structure, and further adders are required to implement all the steps involving and . The first stage of lifting steps ( rotation) requires adders to implement the lifting structure, and further adders are required to implement all the steps involving and . The first stage of lifting steps ( rotation) requires adders to implement the lifting structure, and further adders are required to implement all the steps involving and . The first stage of lifting steps ( rotation) requires adders to implement the lifting structure, and further adders are required to implement all the steps involving and . The first stage of lifting steps ( rotation) requires adders to implement the lifting structure, and further adders are required to implement all the steps involving and . The first stage of lifting steps ( rotation) requires adders to implement the lifting structure, and further adders are required to implement all the steps involving and . The square root block contains adders, and there are square root blocks. Therefore, adders are calculating square roots in parallel. The total number of adders required for Hadamard and lifting scheme is .

Figure 8 shows the experiment setup carried out to calculate the PSNR. The original frames are transformed using the proposed DCT. Then the transformed data is quantized to 13 bits. The quantized coefficients are then passed through the inverse quantization block and inverse DCT. The PSNR is then calculated between the original frames and the reconstructed frames. The inverse quantization is taken using (51)

Table 3 shows the PSNR values for different sequences with different DCT sizes. The PSNR is calculated as shown in (52) and (53) where is the maximum possible value of the image, and , mean square error, can be defined as where and are the input frame and the reconstructed frame, after inverse quantization and IDCT, respectively. From Table 3, it is quite clear that the DCT is showing great efficiency with respect to PSNR. PSNR of frames is very close to 50 dB.

In Tables 4, 5, and 6, the number of multiplications, additions, and shifts, required to calculate different sizes of DCT, are shown. The proposed architecture has no multiplications, where all the multiplications are implemented using shifts and adds. As it can be observed, the number of additions required to compute the 32-point DCT with the proposed architecture is less than the original DCT implementation and the other proposed ones.

The net list is written in VHDL language. Synopsys Design Vision is used for synthesis purpose. The code is synthesized on 90 nm standard cell library at a clock frequency of 150 MHz. Table 7 shows the results of the synthesis.

The time required by the proposed architecture to completely process an macro block is where if and otherwise. Thus, the total time to process one pixel frame is where accounts for the chroma subsampling, for example, for 4 : 4 : 4, for 4 : 2 : 2, and for 4 : 2 : 0. So from (55), taking , , and we obtain ms and ms for and , respectively. As a consequence, in the worst case () the proposed architecture sustains up to 48 frames per second.

#### 6. Conclusion

In this work, a dynamic -point DCT for HEVC is proposed. A partially folded architecture is adopted to maintain speed and to save area. The DCT supports 4, 8, 16, and 32 points. The simulation results show that the PSNR is very close to 50 dB, which is reasonably good. Multiplications are removed from the architecture by introducing lifting scheme and approximating the coefficients.