Abstract

Performing optimal bit-allocation with 3D wavelet coding methods is difficult, because energy is not conserved after applying the motion-compensated temporal filtering (MCTF) process and the spatial wavelet transform. The problem has been solved by extending the 2D wavelet coefficients weighting method with the consideration of the complicated pixel connectivity resulting from the lifting-based MCTF process. However, the computation of weighting factors demands much computing time and vast amount of memory. In this paper, we propose a method, which facilitates the property of sparse matrices, to compute the subband weighing factors efficiently. The proposed method is employed for the 5–3 temporal filter and the 9–7 temporal filter in the lifting-based MCTF process. Experiments on various video sequences show that the computing time and memory needed for the computation of weighting factors are dramatically reduced by applying the proposed approach.

1. Introduction

Scalable video coding, which is developed to support multiple resolutions in a single stream, is an excellent solution to the compression task of the video broadcasting over the heterogeneous network [1, 2]. Compared to H.264/SVC [3], which is recently developed based on the prevailing conventional close-loop codec H.264/AVC, the multiresolution property of 3D wavelet representation based on motion-compensated temporal filtering (MCTF) is a more natural solution to the scalability issue in video coding [47]. However, to compete with the great success of scalable coding methods based on H.264, the MCTF-based 3D wavelet video codec must be constantly improved.

For both of H.264/AVC and the wavelet-based encoder, the coding performance is greatly dependent on the technique used to solve the optimal bit-allocation problem. The discrete cosine transform (DCT), which is an orthogonal transform used in H.264/AVC, benefits the procedure of searching the optimal bit-allocation because the variances of the distortions caused by the quantization are equal in the pixel and the frequency domains. However, the wavelet transforms used in the video/image coding (ex: 5–3 and 9–7 filters) are usually not orthogonal so the variances of the distortions in the pixel domain and the frequency domain are not the same. For image coding, the problem has been elegantly solved [8, 9] by assigning different weights to the subbands, resulting in equivalent reconstruction error variance and quantization error variance in the frequency domain. The method was directly extended for 3D wavelet coding previously [10], but the results are not satisfied because the energy difference between the pixel and frequency domains results not only from using a biorthogonal wavelet but also from the MCTF process, which imposes a different connectivity status (single-connected, multiple-connected, or unconnected) on each pixel during motion prediction. Thus, the direct extension of the methods in [8, 9] from 2D wavelet coefficients to 3D wavelet coefficients cannot solve the 3D wavelet optimal bit-allocation problem because they do not take account of the status of the pixel connectivity in the MCTF process.

To fix this problem, a method, which takes account of both the biorthogonal property of the wavelet and the pixel connectivity resulted from the motion compensation in the MCTF process, is proposed to equal the variances of the distortion in the pixel and frequency domain by assigning weighting to subbands and, therefore, solves the bit-allocation problem in 3D wavelet coding [11]. Although the results show the method is effective, lots of computing time and memory are needed during the computation of the weighting factors and it causes that the method becomes a heavy load of the encoding process. To make the computation of weighting factors more efficient, we must develop an approach to decrease the time and memory used by the process. For this purpose, a special data structure and corresponding operations are proposed to save the time and memory during the computation of the subband weighting factors in this paper.

The remainder of this paper is organized as follows. In Section 2, we briefly review the MCTF-EZBC (embedded zero block coding) [12] wavelet coding scheme. In Section 3, we give a summary of how to derive the spatial and temporal weighting factors. In Section 4, we explain the proposed method used to save time and memory during the computation of the subband weighting factors. We also report the experiment results that compare the time and memory used by the traditional method and our method in Section 5. And finally the conclusion is given in Section 6.

2. A MCTF-EZBC Wavelet Video Coding Scheme

In a MCTF-EZBC wavelet video coding scheme, the video frames are first decomposed into multiple wavelet subbands by spatial and temporal wavelet transforms, then the quantization and entropy coding are sequentially applied to the wavelet subbands. According to the order of the spatial and the temporal decompositions, the wavelet coding schemes can be categorized into two categories: 2D+T (spatial filtering first) and T+2D (temporal filtering first). However, no matter 2D+T or T+2D scheme is applied, the spatial and the temporal filterings can be described independently.

The purpose of spatial filtering is separating low and high frequency coefficients from a video frame. The spatial filtering usually consists of multiple sequential 2D wavelet decompositions. In a 2D wavelet decomposition, the input signal, which is represented by a 𝑁 by 𝑁 two-dimensional matrix, is decomposed into four 𝑁/2 by 𝑁/2 two-dimensional matrices, which are denoted by 𝐿𝐿, 𝐻𝐿, 𝐿𝐻, and 𝐻𝐻. For these subbands, the previous letter means that the subband contains the low (𝐿) or high (𝐻) frequency coefficients after the horizontal 1D wavelet transform, and the following letter means the subband contains the low (𝐿) or high (𝐻) frequency coefficients after the vertical 1D wavelet transform. After the decomposition, subband 𝐿𝐿 is sequentially taken as the input of the next level spatial decomposition.

If we let 𝑘,0 and 𝑘,1 denote the analyzing matrices in the 𝑘-th spatial decomposition, the corresponding wavelet subbands can be computed as𝐹𝑘0𝐹𝑘1𝐹𝑘2𝐹𝑘3=𝑘0𝑘1𝐹(𝑘1)0𝑇𝑘0𝑇𝑘1,(1) where 𝐹𝑘0, 𝐹𝑘1, 𝐹𝑘2, and 𝐹𝑘3 correspond to the 𝐿𝐿, 𝐻𝐿, 𝐿𝐻, and 𝐻𝐻 subbands, respectively. If there is another spatial decomposition applied to the frame, the subband 𝐹𝑘0 is further decomposed, by the 𝑘+1-th analyzing matrices 𝑘+1,0 and 𝑘+1,1. According to this definition, the subband 𝐹00 is the frame before any spatial decomposition, and any wavelet subband after the spatial filtering can be indexed by 𝑥𝑦, which represent the subband is the 𝑦-th subband after the 𝑥-th spatial decomposition. In Figure 1(a), the example shows the wavelet subbands with the indices after performing the spatial filtering consisting of two spatial decompositions.

To reconstruct 𝐹(𝑘1)0 from the wavelet subbands 𝐹𝑘0, 𝐹𝑘1, 𝐹𝑘2, and 𝐹𝑘3, the synthetic matrices 𝒢𝑘0 and 𝒢𝑘1 are used in the synthetic procedure, which can be represented by several matrix operations as𝐹(𝑘1)0=𝒢𝑘0𝒢𝑘1𝐹𝑘0𝐹𝑘1𝐹𝑘2𝐹𝑘3𝒢𝑇𝑘0𝒢𝑇𝑘1.(2)

Compared to the spatial filtering, which takes a frame as its input, the temporal filtering takes multiple frames (T+2D) or the subands with the same spatial indices (2D+T) as its input. The temporal filtering consists of multiple temporal decompositions. However, unlike the spatial filtering, which only needs several 2D wavelet transforms during the procedure, the temporal filtering needs multiple 1D wavelet transforms and motion estimation/compensation to specify the input coefficients of each wavelet transform. So, the temporal filtering is more complicated compared to the spatial filtering.

Usually, a temporal decomposition is implemented by the lifting structure, which consists of the predicting stage and the updating stage. As depicted in in Figure 2, the H-frames (denoted by ) are firstly obtained after the predicting stage, then the L-frames (denoted by 𝑙) are obtained after the updating stage. The lifting structure can be described by the matrix operations, in which a frame is represented by a one-dimensional matrix 𝑓. If each input frame of the temporal filtering has the size 𝑁 by 𝑁 and can be represented by a 𝑁 by 𝑁 two-dimensional matrix 𝐹, the one dimensional matrix 𝑓 has the size 𝑁2 and can be mapped from 𝐹 by letting 𝑓[𝑖𝑁+𝑗]=𝐹[𝑖,𝑗].

The motion vectors are computed before the predicting and updating stages. We use the 𝑃𝑚𝑥,𝑦, which is a two-dimensional 𝑁2 by 𝑁2 matrix, to denote the motion vectors obtained by predicting the 𝑦-th frame from the 𝑥-th frame in the 𝑚-th temporal decomposition. Then, the predicting stage of the 𝑚-th temporal decomposition can be written as𝑚2𝑖+1=𝑓2𝑖+1𝑚1𝑚[]𝑃2𝑖𝑚2𝑖,2𝑖+1𝑓2𝑖𝑚1+𝑚[]𝑃2𝑖+2𝑚2𝑖+2,2𝑖+1𝑓2𝑖+2𝑚1,(3) where 𝑚 is the corresponding coefficient of the wavelet filter. Since the index 𝑚 represents the 𝑚-th lifting stage, the term 𝑓𝑖𝑚1 represents the 𝑖-th frame which is obtained after the (𝑚1)-th lifting stage. If the 5–3 wavelet transform is used, the value of 𝑚 in the predicting stage is −0.5.

In the updating stage, the inverse motion vector matrix 𝑈𝑚𝑦,𝑥, which has the same size as 𝑃𝑚𝑥,𝑦 and can be calculated from the matrix 𝑃𝑚𝑥,𝑦 [13], is used to compute the decomposed L-frame as𝑙𝑚2𝑖=𝑓2𝑖𝑚1+𝑚[]𝑈2𝑖+1𝑚2𝑖+1,2𝑖𝑚2𝑖+1+𝑚[]𝑈2𝑖1𝑚2𝑖1,2𝑖𝑚2𝑖1,(4) where 𝑚 is the corresponding coefficient of the wavelet filter. If the 5–3 wavelet transform is used, the value of 𝑚 in the updating stage is 1.

To understand how to represent the motion vectors with a matrix, an example is given in Figure 3. And these two matrices are constructed as follows:𝑃2𝑖,2𝑖+1=010000010000001000000010000001000001,𝑈2𝑖+1,2𝑖=000000100000001000000000000100000010.(5)

After the 𝑚-th temporal decomposition, if another temporal decomposition is needed for the temporal filtering, the L-frames generated by the current temporal decomposition are taken as its input. So, we let 𝑓𝑖𝑚, which is the 𝑖-th frame of the (𝑚+1)-th temporal decomposition, as 𝑙𝑚2𝑖, which is the output 2𝑖-th frame of the 𝑚-th temporal decomposition. Although the H-frames do not participate in the (𝑚+1)-th temporal decomposition, we still arrange the indices to them by letting 𝑓𝑚𝑖+𝑆=𝑚2𝑖+1, in which 𝑆 is the number of the input frames in the (𝑚+1)-th temporal decomposition. So, any frames obtained by the temporal filtering can be indexed by 𝑚𝑛, which represents it is the 𝑛-th subband after the 𝑚-th temporal decomposition.

Usually, the temporal decompositions are sequentially performed until the output has only one L-frame. Since the decomposed frame can be synthesized, only the frames which cannot be synthesized are necessary to reconstruct all the frames. In Figure 1(b), an example, in which the size of the group of pictures (GOP) is 8, shows the indices of the frames that cannot be synthesized from the decomposed frames.

To recover the original frames, the synthesis is applied to the decomposed frames. The frame which is decomposed last is synthesized first, and vice versa. In the procedure of the synthesis, the inverse updating is firstly performed as𝑓2𝑖𝑚1=𝑙𝑚2𝑖𝒢𝑚[]𝑈2𝑗+1𝑚2𝑗+1,2𝑖𝑚2𝑗+1+𝒢𝑚[]𝑈2𝑗1𝑚2𝑗1,2𝑖𝑚2𝑗1,(6) where 𝒢𝑚 is the coefficient of the wavelet transform used in the inverse updating stage. If the temporal 5–3 filter is used, the value of 𝒢𝑚 is −1 in the inverse updating stage. After the inverse updating stage, the inverse predicting stage is performed as𝑓2𝑖+1𝑚1=𝑚2𝑖+1+𝒢𝑚[]𝑃2𝑗𝑚2𝑗,2𝑖+1𝑓2𝑗𝑚1+𝒢𝑚[]𝑃2𝑗+2𝑚2𝑗+2,2𝑖+1𝑓2𝑗+2𝑚1.(7) If the temporal 5–3 filter is used, the value of 𝒢𝑚 is 0.5 in the inverse updating stage. For some wavelet filters, such as 9–7 filter, a temporal decomposition needs multiple level lifting structure and can be easily extended by cascading multiple one level lifting structures.

After the spatial and temporal filterings, the quantization and entropy coding are applied to these wavelet subbands. The coefficients of these subbands may be quantized by scalar or vector quantization, then the quantized coefficients are coded without loss by the entropy coder. The method is common in the still image compression standard, such as JPEG 2000 [14]. In the decoding process, the quantized coefficients are obtained by decoding the received bitstream, then the subbands are rescaled according to the quantization step used in the encoder. The advantage of separating the quantization and the entropy coding is that the quality of the reconstructed video can be predicted according to the quantization step.

The quantization and the entropy coding can also be combined with the bitplane coding method, such as the EZBC entropy coder [12]. In these methods, the rates allocated to the subbands are calculated first, then the entropy coder encodes the subbands with these rates. The advantage of the scheme is that the rates of the subbands can be any nonnegative integers and the performance of the bit-allocation can be improved accordingly.

No matter the quantization and entropy coding are combined or separated, the bit-allocation greatly affects the coding efficiency. However, because the energy in the pixel domain can be altered after application of the spatial wavelet transform, temporal wavelet transform, and motion estimation in the MCTF process, the bit-allocation is prevented from achieving optimum. To preserve the energy between the pixel and the wavelet domain, we derive the weighting factors of the decomposed subbands. We explain how to derive these weighting factors in the next section.

3. Subband Weighting

The weighting factor indicates how much a unit quantization power in the subband contributes to the overall distortion in the reconstructed GOP, that is, for the subband indexed by (𝑥𝑦,𝑚𝑛), which means the subband is the spatial 𝑦-th subband after the 𝑥-th spatial decomposition, and the temporal 𝑛-th subband after the 𝑚-th temporal decomposition (an example of the index method is depicted in Figure 1), a weighting 𝛾(𝑥𝑦,𝑚𝑛) is given to satisfy𝐷=(𝑥𝑦,𝑚𝑛)𝐒𝛾(𝑥𝑦,𝑚𝑛)×𝐷𝑥𝑦,𝑚𝑛,(8) where 𝐷 is the variance of the distortion in the pixel domain, 𝐷𝑥𝑦,𝑚𝑛 is the variance of subband's distortion in the wavelet domain, and 𝐒 is the set including the indices of the subbands which cannot be synthesized during the reconstruction. The weighting factor 𝛾(𝑥𝑦,𝑚𝑛) can be computed by𝛾(𝑥𝑦,𝑚𝑛)=𝛼(𝑥𝑦)×𝛽(𝑚𝑛),(9) where 𝛼(𝑥𝑦) is the spatial weighting factor and 𝛽(𝑚𝑛) is the temporal weighting factor with respect to the subband indexed by (𝑥𝑦,𝑚𝑛).

The spatial weighting factors can be directly computed according to the error propagation model mentioned in [15]. However, to compute the temporal weighting factors, the effect of the motion compensation must also be considered, and the approach is mentioned in [11]. In the following paragraph, we explain how to derive the spatial and temporal weighting factors, respectively.

3.1. Spatial Subband Weighting

In this section, we review the method used to derive the weighting factors of spatial wavelet transforms in [15]. According to (2), the reconstructed subband 𝐹(𝑘1)0 can be written as𝐹(𝑘1)0=𝒢𝑘0𝐹𝑘0𝒢𝑇𝑘0+𝒢𝑘0𝐹𝑘1𝒢𝑇𝑘1+𝒢𝑘1𝐹𝑘2𝒢𝑇𝑘0+𝒢𝑘1𝐹𝑘3𝒢𝑇𝑘1.(10) Let Δ𝐹 denote the error matrix resulting from the quantization of an image 𝐹; according to (10), we haveΔ𝐹(𝑘1)0=𝒢𝑘0Δ𝐹𝑘0𝒢𝑇𝑘0+𝒢𝑘0Δ𝐹𝑘1𝒢𝑇𝑘1+𝒢𝑘1Δ𝐹𝑘2𝒢𝑇𝑘0+𝒢𝑘1Δ𝐹𝑘3𝒢𝑇𝑘1.(11) Using the property of Kronecker products,𝑉=𝐴𝑈𝐵𝑇𝑣=(𝐴𝐵)𝑢,(12) where 𝑣, 𝑢 are column vectors constructed row-wise from the matrices 𝑉, 𝑈, respectively. Applying this identity to (11), we now haveΔ𝑐(𝑘1)0=𝒢𝑘0𝒢𝑘0Δ𝑐𝑘0+𝒢𝑘0𝒢𝑘1Δ𝑐𝑘1+𝒢𝑘1𝒢𝑘0Δ𝑐𝑘2+𝒢𝑘1𝒢𝑘1Δ𝑐𝑘3,(13) where Δ𝑐 is the column vector constructed row-wise from the matrices Δ𝐹. The reconstruction mean square error (MSE) of an image 𝐹 is𝜎2𝑐=1𝑁2𝐸Δ𝑇𝑐Δ𝑐.(14) The equation can be solved under the high bit rate assumption. At a high bit rate, it is assumed that the quantization errors of wavelet subbands are white and mutually uncorrelated [16]. So, the following identities for the vector representation of errors in subbands 𝑥 and 𝑦 are obtained:𝐸Δ𝑇𝑥Δ𝑥Δ=𝐸tr𝑥Δ𝑇𝑥=𝑁2𝜎2𝑥,𝐸Δ𝑇𝑥Δ𝑦Δ=𝐸tr𝑥Δ𝑇𝑦=0,when𝑥𝑦,(15) where 𝜎2𝑥 is the MSE of any element in the vector Δ𝑥. By substituting (13) into (14) and applying the identities in (15), we obtain𝜎2𝑐(𝑘1)0=1𝑁2𝐸Δ𝑇𝑐(𝑘1)0Δ𝑐(𝑘1)0=1𝑁2𝐸Δtr𝑐(𝑘1)0Δ𝑇𝑐(𝑘1)0=𝜎2𝑐𝑘0𝒢tr𝑘0𝒢𝑘0𝒢𝑘0𝒢𝑘0𝑇+𝜎2𝑐𝑘1𝒢tr𝑘0𝒢𝑘1𝒢𝑘0𝒢𝑘1𝑇+𝜎2𝑐02𝒢tr𝑘1𝒢𝑘0𝒢𝑘1𝒢𝑘0𝑇+𝜎2𝑐𝑘3𝒢tr𝑘1𝒢𝑘1𝒢𝑘1𝒢𝑘1𝑇=𝛼(𝑘0)𝜎2𝑐𝑘0+𝛼(𝑘1)𝜎2𝑐𝑘1+𝛼(𝑘2)𝜎2𝑐𝑘2+𝛼(𝑘3)𝜎2𝑐𝑘3.(16)

The above derivation shows that the MSE measured before the wavelet transform is the weighted sum of subbands' MSE after the wavelet transform. Note that the weighting factors are determined only by the filters. According to the discussion in the previous section, the subband indexed by (𝑘0) is further decomposed in the (𝑘+1)-th spatial decomposition. So, the spatial weighting factors of a multilevel decomposition case can be solved by the recursive substitution. To summarize, the MSE of the original frame can be computed as𝜎2𝑐00=𝛼(10)𝜎2𝑐10+𝛼(11)𝜎2𝑐11+𝛼(12)𝜎2𝑐12+𝛼(13)𝜎2𝑐13=𝛼𝛼(10)(20)𝜎2𝑐20+𝛼(21)𝜎2𝑐21+𝛼(22)𝜎2𝑐22+𝛼(23)𝜎2𝑐23+𝛼(11)𝜎2𝑐11+𝛼(12)𝜎2𝑐12+𝛼(13)𝜎2𝑐13=𝑠3𝑘=1𝑖=1𝑘1𝑗=1𝛼𝛼(𝑗0)(𝑘𝑖)𝜎2𝑐𝑘𝑖+𝑠𝑗=1𝛼(𝑗0)𝜎2𝑐𝑠0,(17) in which 𝑠 is the total number of the spatial wavelet transforms. So, the multilevel spatial weighting factor 𝛼(𝑥𝑦) can be computed from one-level spatial weighting factor 𝛼(𝑥𝑦) as𝛼(𝑥𝑦)=𝑥1𝑗=1𝛼(𝑗0)𝛼(𝑥𝑦),for𝑥{1,2,3,,𝑠},𝑦{1,2,3},𝑠𝑗=1𝛼(𝑗0),for𝑥=𝑠,𝑦=0.(18)

3.2. Temporal Subband Weighting

We follow the scheme proposed in [11] to explain how to derive and compute the temporal subband weighting factors. The temporal 5–3 filter, which is usually used in the wavelet video coding, is used to explain the procedure. And the derivation of the weighting factors with respect to the temporal 9–3 filter, which is the other filter usually used in the wavelet video coding, is detailed in [11].

We first explain how to compute the weighting factor 𝛽(𝑚𝑛), and the weighting factor 𝛽 is expected to satisfy𝑆1𝑖=0𝜎2𝑓𝑖𝑚1=𝑆1𝑖=0𝛽(𝑚𝑖)𝜎2𝑓𝑖𝑚,(19) where 𝑆 is the total number of the input subbands in the 𝑚-th temporal decomposition.

If the temporal 5–3 filter is used in the MCTF process, the filter coefficients 𝑚 used in the predicting and the updating stages are −0.5 and 1, respectively. To simplifying the notation, 𝑚𝑃𝑚 is denoted by 𝑃𝑚, and 𝑚𝑈𝑚 is denoted by 𝑈𝑚, then (3) and (4) can be written as𝑚2𝑖+1=𝑓2𝑖+1𝑚1𝑃𝑚2𝑖,2𝑖+1𝑓2𝑖𝑚1+𝑃𝑚2𝑖+2,2𝑖+1𝑓2𝑖+2𝑚1,𝑙(20)𝑚2𝑖=𝑓2𝑖𝑚1+𝑈𝑚2𝑖+1,2𝑖𝑚2𝑖+1+𝑈2𝑖1,2𝑖m𝑚2𝑖1.(21)

Let Δ𝑓 represents the quantization error resulting from lossy source coding, and 𝑓+Δ𝑓 denotes the reconstructed frame. From (21), we can obtain the reconstruction error of the 2𝑖-th frame as follows:Δ𝑓2𝑖𝑚1=Δ𝑙𝑚2𝑖𝑈𝑚2𝑖+1,2𝑖Δ𝑚2𝑖1𝑈𝑚2𝑖+1,2𝑖Δ𝑚2𝑖+1.(22)

Substituting (22) into (20) for Δ𝑓2𝑖𝑚1 and Δ𝑓2𝑖+2𝑚1, we obtain the following reconstruction error of the (2𝑖+1)-th frame:Δ𝑓2𝑖+1𝑚1=Δ𝑚2𝑖+1+𝑃𝑚2𝑖,2𝑖+1Δ𝑓2𝑖𝑚1+𝑃𝑚2𝑖,2𝑖+1Δ𝑓2𝑖+2𝑚1=𝑃𝑚2𝑖,2𝑖+1Δ𝑙𝑚2𝑖+𝑃𝑚2𝑖,2𝑖+1Δ𝑙𝑚2𝑖+2𝑃𝑚2𝑖,2𝑖+1𝑈02𝑖,Δ𝑚2𝑖1+𝐼𝑃𝑚2𝑖,2𝑖+1𝑈𝑚2𝑖+1,2𝑖𝑃𝑚2𝑖,2𝑖+1𝑈𝑚2𝑖+1,2𝑖+2Δ𝑚2𝑖+1𝑃𝑚2𝑖,2𝑖+1𝑈02𝑖+2,+Δ𝑚2𝑖+3.(23)

Equations (22) and (24) represent a motion-dependent error propagation model for a one-level MCTF process using the 5–3 temporal wavelet filter. The reconstructed MSE of the 2𝑖-th frame can be derived as follows: by (14),𝜎2𝑓2𝑖𝑚1=1𝑁2𝐸Δ𝑇𝑓2𝑖𝑚1Δ𝑓2𝑖𝑚1=1𝑁2𝐸Δtr𝑙𝑚2𝑖Δ𝑇𝑙𝑚2𝑖𝑈+tr𝑚2𝑖1,2𝑖Δ𝑚2𝑖1Δ𝑇𝑚2𝑖1𝑈𝑚2𝑖1,2𝑖𝑇𝑈+tr𝑚2𝑖+1,2𝑖Δ𝑚2𝑖+1Δ𝑇𝑚2𝑖+1𝑈𝑚2𝑖,2𝑖+1,2𝑖𝑇𝑈2tr𝑚2𝑖1,2𝑖Δ𝑚2𝑖1Δ𝑇𝑙𝑚2𝑖𝑈+2tr𝑚2𝑖+1,2𝑖Δ𝑚2𝑖+1Δ𝑇𝑚2𝑖1𝑈𝑚2𝑖1,2𝑖𝑇𝑈2tr𝑀2𝑖+1,2𝑖Δ𝑚2𝑖+1Δ𝑇𝑙𝑚2𝑖.(24) By defining 𝒯(𝒜)=tr(𝒜𝒜𝑇) and using the high bit rate assumption that derives the identities in (15), the last three cross-terms are zero and (24) becomes𝜎2𝑓2𝑖𝑚1=𝜎2𝑙𝑚2𝑖𝑈+𝒯2𝑖+1,2𝑖𝜎22𝑖1𝑈+𝒯2𝑖1,2𝑖𝜎2𝑚2𝑖+1𝑓=𝜔2𝑖𝑚1𝑙𝑚2𝑖𝜎2𝑙𝑚2𝑖𝑓+𝜔2𝑖𝑚1𝑚2𝑖1𝜎2𝑚2𝑖1𝑓+𝜔2𝑖𝑚1𝑚2𝑖+1𝜎2𝑚2𝑖+1.(25) Following the same derivation, the reconstruction error of the (2𝑖+1)-th frame is𝜎2𝑓2𝑖+1𝑚1𝑃=𝒯2𝑖+1,2𝑖𝜎2𝑙𝑚2𝑖𝑃+𝒯2𝑖+1,2𝑖+2𝜎2𝑙𝑚2𝑖+2𝑃+𝒯2𝑖1,2𝑖𝑈2𝑖+1,2𝑖𝜎2𝑚2𝑖1+𝒯𝐼𝑃2𝑖1,2𝑖𝑈2𝑖1,2𝑖𝑃2𝑖+1,2𝑖+2𝑈2𝑖+3,2𝑖+2𝜎2𝑚2𝑖+1𝑃+𝒯2𝑖+3,2𝑖+2𝑈2𝑖+1,2𝑖+2𝜎2𝑚2𝑖+3𝑓=𝜔2𝑖+1𝑚1𝑙𝑚2𝑖𝜎2𝑙𝑚2𝑖𝑓+𝜔2𝑖+1𝑚1𝑙𝑚2𝑖+2𝜎2𝑙𝑚2𝑖+2𝑓+𝜔2𝑖+1𝑚1𝑚2𝑖1𝜎2𝑚2𝑖1𝑓+𝜔2𝑖+1𝑚1𝑚2𝑖+1𝜎2𝑚2𝑖+1𝑓+𝜔2𝑖+1𝑚1𝑚2𝑖+3𝜎2𝑚2𝑖+3.(26)

By applying derivations similar to those in (25) and (26) to each input frame, we can relate the reconstruction errors before and after the temporal decomposition by a linear relation:𝜎2𝑓2𝑖2𝑚1𝜎2𝑓2𝑖1𝑚1𝜎2𝑓2𝑖𝑚1𝜎2𝑓2𝑖+1𝑚1𝜎2𝑓2𝑖+2𝑚1𝜎2𝑓2𝑖+3𝑚1𝜎2𝑓2𝑖+4𝑚1=𝒲𝑚𝑚1𝜎2𝑚2𝑖3𝜎2𝑙𝑚2𝑖2𝜎2𝑚2𝑖1𝜎2𝑙𝑚2𝑖𝜎2𝑚2𝑖+1𝜎2𝑙𝑚2𝑖+2𝜎2𝑚2𝑖+3=𝒲𝑚𝑚1𝒳𝑚𝑚1𝜎2𝑓𝑚2𝑖3𝜎2𝑓𝑚2𝑖2𝜎2𝑓𝑚2𝑖1𝜎2𝑓𝑚2𝑖𝜎2𝑓𝑚2𝑖+1𝜎2𝑓𝑚2𝑖+2𝜎2𝑓𝑚2𝑖+3,(27)

where𝒲𝑚𝑚1=𝜔𝑓2𝑖2𝑚1𝑚2𝑖3𝜔𝑓2𝑖2𝑚1𝑙𝑚2𝑖2𝜔𝑓2𝑖2𝑚1𝑚2𝑖1𝜔𝑓00002𝑖1𝑚1𝑚2𝑖3𝜔𝑓2𝑖1𝑚1𝑙𝑚2𝑖2𝜔𝑓2𝑖1m1𝑚2𝑖1𝜔𝑓2𝑖1𝑚1𝑙𝑚2𝑖𝜔𝑓2𝑖1𝑚1𝑚2𝑖+1𝑓0000𝜔2𝑖𝑚1𝑚2𝑖1𝜔𝑓2𝑖𝑚1𝑙𝑚2𝑖𝜔𝑓2𝑖𝑚1𝑚2𝑖+1𝑓0000𝜔2𝑖+1𝑚1𝑚2𝑖1𝜔𝑓2𝑖+1𝑚1,𝑙𝑚2𝑖𝜔𝑓2𝑖+1𝑚1𝑚2𝑖+1𝜔𝑓2𝑖+1𝑚1𝑙𝑚2𝑖+2𝜔𝑓2𝑖+1𝑚12i𝑚+3𝑓0000𝜔2𝑖+2𝑚1𝑚2𝑖+1𝜔𝑓2𝑖+2𝑚1𝑙𝑚2𝑖+2𝜔𝑓2𝑖+2𝑚1𝑚2𝑖+3𝑓0000𝜔2𝑖+3𝑚1𝑚2𝑖+1𝜔𝑓2𝑖+3𝑚1𝑙𝑚2𝑖+2𝜔𝑓2𝑖+3𝑚1𝑚2𝑖+3,(28)𝒳𝑚𝑚1=000010001000000000000100010000000000001000100000.(29) The matrix 𝒳𝑚𝑚1 is used to adjust the positions of the subbands. To measure the consequence of quantizing a subband after the temporal decomposition, we should aggregate the errors induced in all the subbands before the temporal decomposition. This is exactly the summation of the corresponding column of the matrix 𝒲𝑚𝑚1𝒳𝑚𝑚1 for the subband. For example, the weighting factor of the reconstruction error resulting from subband 𝑓𝑚2𝑖+1 is the summation of the values in the 2𝑖+1 column of 𝒲𝑚𝑚1𝒳𝑚𝑚1 and denoted by 𝛽(𝑚(2𝑖+1)). The following equation shows how to compute the weighting factor 𝛽:𝛽𝛽(𝑚0)𝛽(𝑚1)𝛽(𝑚2)(𝛽𝑚3)𝛽(𝑚4)𝛽(𝑚5)=(𝑚6)𝑞𝒲𝑚𝑚1𝒳𝑚𝑚1𝑞1𝑞𝒲𝑚𝑚1𝒳𝑚𝑚1𝑞2𝑞𝒲𝑚𝑚1𝒳𝑚𝑚1𝑞3𝑞𝒲𝑚𝑚1𝒳𝑚𝑚1𝑞4𝑞𝒲𝑚𝑚1𝒳𝑚𝑚1𝑞5𝑞𝒲𝑚𝑚1𝒳𝑚𝑚1𝑞6𝑞𝒲𝑚𝑚1𝒳𝑚𝑚1𝑞7.(30)

If the temporal filtering consists of total 𝑡 temporal decompositions, the variances of the distortions of the frames in the pixel domain can be computed as𝜎2𝑓00𝜎2𝑓10𝜎2𝑓20𝜎2𝑓30𝜎2𝑓40𝜎2𝑓50𝜎2𝑓60=𝑡𝑚=1𝒲𝑚𝑚1𝒳𝑚𝑚1𝜎2𝑓0𝑡𝜎2𝑓1𝑡𝜎2𝑓2𝑡𝜎2𝑓3𝑡𝜎2𝑓4𝑡𝜎2𝑓5𝑡𝜎2𝑓6𝑡=𝑡𝑚=1𝒲𝑚𝑚1𝒳𝑚𝑚1𝜎2𝑓𝑡(𝑡0)𝜎2𝑓𝑡(𝑡1)𝜎2𝑓𝑡((𝑡1)2)𝜎2𝑓𝑡((𝑡1)3)𝜎2𝑓𝑡((𝑡1)4)𝜎2𝑓𝑡((𝑡1)5)𝜎2𝑓𝑡((𝑡1)6).(31) Please note the index of subband 𝑓𝑖 can be changed from 𝑖 to (𝑚𝑛), which represents the subband is obtained as the 𝑛-th subband after the 𝑚-th temporal decomposition (see Figure 1(b)). So, the temporal weighting factor 𝛽(𝑚𝑛) used in (9) can be computed as𝛽(𝑚𝑛)=𝑞𝑡𝑚=1𝒲𝑚𝑚1𝒳𝑚𝑚1𝑞𝑛.(32)

Since the spatial and temporal weighting factors are both derived, the performance of the bit-allocation method can be improved accordingly. In our previous work [11], we compared three bit-allocation methods, which consider spatial weighting factors, spatial-temporal weighting factors without motion vectors, and spatial-temporal weighting factors with motion vectors, respectively. And the the results showed the bit-allocation method considering spatial-temporal weighting factors with motion vectors greatly outperforms the other two methods. However, the computation of the best weighting method costs lots of computing resources. So, the method, which reduces the needed resources, is proposed in the next section.

4. Fast Computation of the Weighting Factors

As our discussion in the previous section, (16) is used to compute the spatial weighting factors, and (24) and (25) are used to compute the temporal weighting factors. However, a lot of computing resources are needed to obtain these weighting factors. Let the size of an input frame is 𝑁 by 𝑁. In (16), (24), and (25), if we use matrices to represent 𝒢𝒢, 𝑃, and 𝑈, the size of each matrix is 𝑁2 by 𝑁2. If 𝑚 bytes are used to store a coefficient in a matrix, a matrix would take 𝑁2×𝑁2×𝑚 bytes in storage. For a CIF frame (of size 352×288), the matrix representation needs at least 39204 megabytes for only one matrix. Thus, using the matrix representation to compute the weighting factors is extremely inefficient since it needs to store many 𝑁2 by 𝑁2 matrices.

Another issue that handicap using the matrix representation for weighting factor is the computing time to perform matrix addition and multiplication. If we use the conventional matrix operations, which are defined as follows:[]=(𝐴𝐵)𝑥,𝑦𝑛𝑖=1𝐴[]𝐵[]𝑥,𝑖𝑖,𝑦,(33)[][][](𝐴+𝐵)𝑥,𝑦=𝐴𝑥,𝑦+𝐵𝑥,𝑦,(34) the time complexity of the matrix multiplication in computing the weighting factors is bounded by 𝑂(𝑁6). For a CIF frame, it costs approximately 1015 real number multiplications for performing one matrix multiplication. Fortunately, these matrices are usually very sparse, that is, only few coefficients in these matrices are not zero. By taking advantage of the property, we use an efficient data structure to store the required data in computing the weighting factors.

Thus, we propose to use arraylist representation. An arraylist consists of a one-dimensional array comprised of pointers to linklists. Figure 4 is an example to demonstrate the arraylist structure. In the proposed data structure, two arraylists, which, respectively, record the positions and values of the nonzero coefficients, are used to represent a sparse matrix of large size. In Figure 4, the matrix 𝑃 is an abbreviation of the matrix 𝑃2𝑖,2𝑖+1. The arraylists 𝑃row,pos and 𝑃row,val, respectively, record the positions and values of nonzero coefficients in the matrix 𝑃. Because the matrix 𝑃 has 6 rows, both arraylists have 6 elements. For the arraylist 𝑃row,pos, the 𝑥-th linklist links to the locations of nonzero elements at the 𝑥-th row of the matrix. The arraylist 𝑃row,val has a similar structure at that of 𝑃row,pos, instead of linking to locations of nonzero elements, 𝑃row,val links to the values of the nonzero elements. That if the position of the nonzero coefficient is recorded in 𝑃row,pos[𝑥,𝑦], we can deduce that the 𝑥-row and 𝑦-column element of the matrix 𝑃 is a nonzero coefficient, and whose value is stored in 𝑃row,val[𝑥,𝑦].

If the matrix is sparse enough, the proposed arraylist data structure can greatly reduce the memory because only the nonzero locations and values are stored in the arraylists. If we assume that there are 𝑛 nonzero coefficients in each row of an 𝑁2 by 𝑁2 matrix. To construct an arraylist, we need a one-dimensional array of 𝑁2 elements, and 𝑁2 linklists, one for each element in the arraylist. Each linklist has 𝑛 elements. Because our structure uses two arraylists, it needs 2(𝑛+1)𝑁2𝑚 bytes, where 𝑚 is the number of bytes used to store a coefficient. Thus, our data structure has reduced to size to represent an 𝑁2 by 𝑁2 matrix from 𝑚𝑁4 bytes to 2(𝑛+1)𝑁2𝑚 bytes. The proposed data structure in many examples can reduce the memory to store a of size 3522×2882 matrix from 39204 megabytes to about 1 megabytes.

After solving the matrix storage, we use the matrices in (5) to explain how to perform matrix addition and multiplication with the proposed data structure. To simplify the notation, 𝑃 and 𝑈 are used to represent 𝑃2𝑖,2𝑖+1 and 𝑈2𝑖+1,2𝑖 in (5), respectively. In matrix addition, we first create two new arraylists 𝐶row,pos and 𝐶row,val to store the result of the addition with 𝐶row,pos=𝑃row,pos,𝐶row,val=𝑃row,val.(35) For all 𝑥{1,2,3,,𝑁2}, we move the linklists 𝑈row,pos[𝑥] and 𝑈row,val[𝑥] to the tails of the linklists 𝐶row,pos[𝑥] and 𝐶row,val[𝑥], respectively. Then, all linklists in 𝐶row,pos are examined to search if any two elements have the same value. If the two elements, denoted by 𝐶row,pos[𝑥,𝑚] and 𝐶row,pos[𝑥,𝑛] with 𝑚<𝑛, have the same value, then the value of 𝐶row,val[𝑥,𝑚]+𝐶row,val[𝑥,𝑛] is assigned to the element 𝐶row,val[𝑥,𝑚]. Finally, the element 𝐶row,val[𝑥,𝑛] is removed for the consistency of the data structure. If no elements of a linklist in 𝐶row,pos have the same value for all the rows, the matrix addition with respect to the proposed data structure is completed. The results of the matrix 𝑃+𝑈 are stored in 𝐶row,pos and 𝐶row,val. Figure 5 demonstrates the matrix addition by using the proposed data structure.

To obtain the matrix multiplication 𝑃𝑈 with the proposed data structure, the coefficients (𝑃𝑈)[𝑥,𝑦] are needed for all 𝑥,𝑦{1,2,3,,𝑁2}. We create two empty arraylists 𝑀row,pos and 𝑀row,val, each of which arraylists has 𝑁2 linklists to store the result of matrix multiplication. To compute (𝑃𝑈)[𝑥,𝑦], we examine the linklists 𝑃row,pos[𝑥] and 𝑈column,pos[𝑦] to search for the elements of the same value, then we multiply their corresponding values in 𝑃row,val[𝑥] and 𝑈column,val[𝑦]. The value of (𝑃𝑈)[𝑥,𝑦] is the summation of the products obtained by the process. To store the value of (𝑃𝑈)[𝑥,𝑦], the values 𝑦 and (𝑃𝑈)[𝑥,𝑦] are appended to the tail of the 𝑥-th link lists of 𝑀row,pos and 𝑀row,val, respectively. The result of performing matrix multiplication with the specified data structure is depicted in Figure 6.

If we assume that a row of the matrix has 𝑛 nonzero coefficients, to compute the coefficients of a row with respect to the matrix addition needs a matching algorithm between two sets with 𝑛 elements, several (<n) real-number additions for the matched elements, several moving (<n) and deleting (<n) operations on the linklists. Compared to the time complexity 𝑂(𝑁4) of the conventional matrix addition, the time complexity 𝑂(𝑛𝑁2) of the our method is much more efficient, if the matrix is sparse enough. Meanwhile, for the matrix multiplication 𝑃𝑈 with the proposed data structure, for any 𝑥,𝑦{1,2,3,,𝑁2}, the value of (𝑃𝑈)[𝑥,𝑦] is necessary. And (𝑃𝑈)[𝑥,𝑦] can be computed by a matching algorithm between two sets with 𝑛 elements, several (<n) real-number multiplications, and a summation of these real-number. Compared to the time complexity 𝑂(𝑁6) of the conventional matrix addition, the time complexity 𝑂(𝑛𝑁4) of the proposed method is much lower if the matrix is sparse enough.

In addition to the matrix addition and multiplication, the operator 𝒯 defined by 𝒯(𝒜)=𝑡𝑟(𝒜𝒜𝑇) is also needed during the computation of the subband weighting factors. However, the value of 𝒯(𝒜) (of size 𝑁2 by 𝑁2) can be computed by summing all squares of the coefficients in the matrix 𝒜. For the conventional matrix representation, to compute the value of 𝒯(𝒜) needs 𝑁4 real number multiplications and about 𝑁4 real number additions, and for the proposed data structure, it needs 𝑛𝑁2 real number multiplications and about 𝑛𝑁2 real number additions, where 𝑛 is the average number of nonzero coefficients in a row of the matrix.

Although the arraylist representation is more complicated than the matrix representation, the proposed method can reduce the memory as well as the computing time while the operated matrices are quite sparse. As a consequence, because most of the matrices used to compute the weighting factors are very sparse, the proposed arraylist representation is extremely suitable for the weighting factor computing process.

5. Experiment Results

We now demonstrate the efficacy of applying the arraylist representation to compute the weighting factors. In the experiment, we compare the time and memory used by the matrix representation and the arraylist representation during the computation of the weighting factors. The conventional matrix representation uses a two-dimensional array to store a matrix and the operations in (34) to calculate the addition and multiplication between two matrices. The arraylist representation uses two arraylists to store a matrix, and the operations is performed as the description in the previous section.

A T+2D video codec is used in the experiment. Because a GOP has 32 CIF frames, a five-level MCTF process using the 5–3 or the 9–7 temporal wavelet filter is applied to the GOP. Then, each frame of a video sequence is decomposed by applying a three-level spatial wavelet transform using either the 5–3 or the 9–7 wavelet filter. To search the motion vectors used in the temporal filtering, the motion estimation using a full-search with integer-pixel accuracy on the dyadic wavelet coefficients is applied. The block size is 4×4, and the search range for both the horizontal and vertical dimensions is [16,15].

After the MCTF process, the motion vectors and the wavelet filters used in the spatial and temporal filterings are used to compute the weighting factors. We demonstrate and compare the results on two different schemes. Scheme  1 computes the weighting factors with the conventional matrix representation. Scheme  2 computes the weighting factors with the proposed arraylist representation. The equipment used in the experiment is a work station with double quadcore CPUs and total 12 GB memory. Both 5–3 and 9–7 temporal filters are experimented for various 1/4QCIF (88×72) sequences.

Figures 7 compares the time consumed by Scheme  1 and Scheme  2. The consumed time is measured by subtracting the time stamp between the beginning and the end of computing the weighting factors. We observe that the time consumed by Scheme 2 is much less than that consumed by Scheme 1.

We also compare the amount of memory used by these two schemes. Figure 8 illustrates the amount of memory used by Scheme 2 is much less than that used by Scheme 1. The amount of used memory is obtained by averaging the monitored size of memory used to compute each weighting factor.

No matter the time or memory used to compute the weighting factors, the experimental results show the proposed arraylist representation greatly reduce them compared to the conventional matrix representation.

6. Conclusion

In this paper, we proposed the arraylist representation for a matrix and the proposed method reduces the time and memory consumed used by the computation of weighting factors. We employed the proposed method on a T+2D structure and show that the time and memory needed to compute the weighting factors are greatly reduced compared to the conventional matrix representation. In a future work, we will expand the method to deal with the motion vectors obtained from the subpixel motion estimation to support more wavelet encoders.