Abstract

A novel watermarking framework for scalable coded video that improves the robustness against quality scalable compression is presented in this paper. Unlike the conventional spatial-domain (t + 2D) water-marking scheme where the motion compensated temporal filtering (MCTF) is performed on the spatial frame-wise video data to decompose the video, the proposed framework applies the MCTF in the wavelet domain (2D + t) to generate the coefficients to embed the watermark. Robustness performances against scalable content adaptation, such as Motion JPEG 2000, MC-EZBC, or H.264-SVC, are reviewed for various combinations of motion compensated 2D + t + 2D using the proposed framework. The MCTF is improved by modifying the update step to follow the motion trajectory in the hierarchical temporal decomposition by using direct motion vector fields in the update step and implied motion vectors in the prediction step. The results show smaller embedding distortion in terms of both peak signal to noise ratio and flickering metrics compared to frame-by-frame video watermarking while the robustness against scalable compression is improved by using 2D + t over the conventional t + 2D domain video watermarking, particularly for blind watermarking schemes where the motion is estimated from the watermarked video.

1. Introduction

Several attempts have been made to extend the image watermarking algorithms into video watermarking by using them either on frame-by-frame basis or on 3D decomposed video. The initial attempts on video watermarking were made by frame-by-frame embedding [14], due to its simplicity in implementation using image watermarking algorithms. Such watermarking algorithms consider embedding on selected frames located at fixed intervals to make them robust against frame dropping-based temporal adaptations of video. In this case, each frame is treated separately as an individual image; hence, any image-watermarking algorithm can be adopted to achieve the intended robustness. But frame-by-frame watermarking schemes often perform poorly in terms of flickering artefacts and robustness against various video processing attacks including temporal desynchronization, video collusion, video compression attacks, and so forth. In order to address some of these issues, the video temporal dimension is exploited using different transforms, such as discrete Fourier transform (DFT), discrete cosine transform (DCT), or discrete wavelet transform (DWT). These algorithms decompose the video by performing spatial 2D transform on individual frames followed by 1D transform in the temporal domain. Various transforms are proposed in 3D decomposed watermarking schemes, such as 3D DFT domain [5], 3D DCT domain [6], and more popularly multiresolution 3D DWT domain watermarking [7, 8]. A multilevel 3D DWT is performed by recursively applying the above-mentioned procedure on low-frequency spatiotemporal subband. Various watermarking methods similar to image watermarking are then applied to suitable subbands to balance the imperceptibility and robustness. 3D decomposition-based methods overcome the issues like temporal desynchronization, video format conversion, and video collusion. However, such naive subband decomposition-based embedding strategies without considering motion element of the sequence during watermark embedding often result in unpleasant flickering visual artefacts. The amount of flickering in watermarked sequences varies according to the texture, colour, and motion characteristics of the video content as well as the watermark strength and the choice of frequency subband used for watermark embedding. At the same time, these schemes are also fragile to video compression attacks, which consider motion trajectory during compression coding.

In order to address such issues stated above, we have extended image watermarking techniques into video considering the motion and texture characteristics of the video sequence using wavelet-based motion compensated 2D + t + 2D filtering. The proposed approach is evolved from the motion compensated temporal filtering- (MCTF-) based wavelet domain video decomposition concept. MCTF has been successfully used in wavelet-based scalable video coding research [9, 10]. The idea of MCTF was originated from 3D subband wavelet decomposition, which is merely an extension of spatial domain transform into temporal domain [11]. But 3D wavelet decomposition alone does not decouple motion information and it is addressed by using temporal filtering along the motion trajectories. This MCTF-based video decomposition technique motivates a new avenue in transform domain video watermarking. Few attempts have already been made to investigate the effect of motion in video watermarking by incorporating motion compensation into video watermarking algorithms [1214]. In these investigations, the sequence is first temporally decomposed into Haar wavelet subbands using MCTF and then spatially decomposed using the 2D DCT transform resulting in the decomposition scheme widely known as t + 2D.

In this paper, we aim to advance further by investigating along the line of MCTF-based wavelet coding to propose a robust video watermarking scheme against scalable content adaptation, such as Motion JPEG 2000, MC-EZBC, or H.264-SVC, while keeping the imperceptibility. Apparent problems of direct use of MCTF and t + 2D decompositions in watermarking are three fold.(1)In scalable video coding research, it is evident that video with different texture and motion characteristics leading to its spatial and temporal features perform differently on t + 2D domain [9] and its alternative 2D + t domain [15], where MCTF is performed on 2D wavelet decomposition domain. Further, in 3D subband decomposition for video watermarking, the use of MCTF is only required for subbands where the watermarks are embedded. Hence, the motion estimation and compensation on full spatial dimension (t + 2D case) add unnecessary complexity to the watermarking algorithm.(2)Conventional MCTF is focused on achieving higher compression and thus gives more attention on the prediction-lifting step in MCTF. However, for watermarking, it is necessary to follow the motion trajectory of content into low-frequency temporal subband frames, in order to avoid motion mismatch in the update step of MCTF when these frames are modified due to watermark embedding.(3)t + 2D structure offers better energy compaction in the low-frequency temporal subband, keeping most of the coefficient values to very small or nearly zero in high-frequency temporal subbands. This is very useful during compression but leaves very little room for watermark embedding in high-frequency temporal subbands. Therefore, for a robust algorithm, most of the MCTF domain watermarking schemes, mentioned before, embed the watermark in the lowpass temporal frames. On the other hand, 2D + t provides more energy in high-frequency subbands, which enables the possibility to embed and recover the watermark robustly using highpass temporal frames which improves the overall imperceptibility of the watermarked video.

To overcome these shortcomings, we propose MCTF-based 3D wavelet decomposition scheme for video sequences and offer a flexible 2D + t + 2D generalized motion compensated temporal-spatial subband decomposition scheme using a modified MCTF scheme for video watermarking. Using the framework, we study and analyze the merits and the demerits of watermark embedding using various combinations of 2D + t + 2D structure and propose new 2D + t video watermarking algorithms to improve the robustness performance against quality scalable video compression.

The rest of the paper is organized as follows. In Section 2, the modified MCTF scheme is presented along with the new 2D + t + 2D subband decomposition framework. The video watermarking algorithms using the implementation of different subband decomposition schemes are proposed in Section 3. The analysis of the framework is described in Section 4. The experimental results are shown and discussed in Section 5 followed by the conclusions in Section 6.

2. Motion Compensated Spatiotemporal Filtering

The generalized spatiotemporal decomposition scheme consists of two modules: (1) MCTF and (2) 2D spatial frequency decomposition. To capture the motion information accurately, we have modified the commonly used lifting-based MCTF by tracking interframe pixel connectivity and use the 2D wavelet transform for spatial decomposition. In this section, first we describe the MCTF with implied motion estimation and then propose the 2D + t + 2D general framework.

2.1. MCTF with Implied Motion Estimation

We formulate the MCTF scheme giving more focus into the motion trajectory-based update step as follows. Let be the video sequence, where t is the time index in display order. We consider two consecutive frames and , as the current frame () and the reference frame (), respectively, following the video coding terminology. In traditional motion estimation for lifting-based MCTF [9], frame usually partitioned into nonoverlapping blocks and for each block, motion is estimated from frame using a block matching algorithm. In this case only, two types of pixel connectivity are considered, (1) pixels are connected or (2) unconnected. In the case of several pixels are connected to the same pixel in the reference frame, only one of them is categorized as a connected pixel. The temporal frames are derived using the subband analysis pair by replacing the as the low-frequency temporal frame (L) and as the high-frequency temporal frame (H).

Connected pixels:
where and represent the motion vector fields: vertical and horizontal displacements of the nonoverlapping blocks, respectively.

Unconnected pixels:
For the unconnected pixels in , the scaled displaced frame differences are substituted to form the temporal high subband.
As stated in the introduction, such a traditional scheme gives more attention on the prediction-lifting step in MCTF to reduce the prediction error in high-frequency subband. This is useful in a compression scenario. However, in the case of watermarking, we account the object motion within low-frequency temporal frames to avoid motion mismatch in update step when these frames are modified due to watermark embedding. To address this, we have used MCTF with implied motion estimation, which allows opportunity to embed the watermark in any chosen low- or high-frequency temporal frame. At the same time, as opposed to the traditional scheme, we consider the relative contributions of one-to-many connected pixels and this is important to capture the motion information accurately during MCTF operation.
In the proposed scheme, the frame is partitioned into nonoverlapping blocks and for each block, vertical and horizontal displacements are quantified and represented as motion vector fields and , respectively. Figure 1 shows an example how the four nonoverlapping blocks in the current frame () are moved in different direction in the next frame (). In the frame, each block can be one of two types, namely, inter- and intrablocks, where the motion is only estimated for the former block type. Similarly, in the frame, any pixel can be one of three types, namely, one-to-one connected (point A), one-to-many connected (point B and C), and unconnected (point D) (as shown in Figure 1), depending on its connectivity to pixels in the frame. The connectivity follows the implied motion vector fields and , which are simply the directional inverse of the original motion vector fields, and .
Considering these block and pixel classifications, the lifting steps for pixels at positions in frames and (i.e., and ) performing the temporal motion compensated Haar wavelet transform are defined as follows.

Forward Transform
The Prediction Step
For one-to-one connected pixels, For one-to-many connected pixels, where is the total number of connections.
For unconnected pixels, The last case is similar to the no prediction case as in intrablocks used in conventional MCTF.

The Update Step
For interblocks, every pixel in an interblock is one-to-one connected with a unique pixel in . Then, the update step is computed as For intrablocks, as there are no motion compensated connections with , Finally, these lifting steps are followed by the normalization step: The temporally decomposed frames and are the first level low- and highpass frames and are denoted as and temporal subbands. These steps are repeated for all frames in to obtain and subbands and continued to obtain the desired number of temporal decomposition levels.

Inverse Transform
For the inverse transform, the order of operation of steps is reversed as stated follows.
First, the decomposed coefficients are passed through an unnormalization step followed by the inverse lifting steps:

The inverse update step

The inverse prediction step

2.2. 2D + t + 2D Framework

In a 3D video decomposition scheme, t + 2D is achieved by performing temporal decomposition followed by a spatial transform whereas in case of 2D + t, the temporal filtering is done after the spatial 2D transform. Due to its own merit and demerit, it is required to analyze both the combinations in order to enhance the video watermarking performance. A common flexible reconfigurable framework, which allows creating such possible combinations, is particularly useful for applications like video watermarking. Here, we propose the 2D + t + 2D framework by combining the modified motion compensated temporal filtering with spatial 2D wavelet transformation.

Let () be the number of decomposition levels used in the 2D + t + 2D subband decomposition to obtain a 3D subband decomposition with motion compensated t temporal levels and spatial levels, where . In such a scheme, first the 2D DWT is applied for an level decomposition. As a result, a new sequence is formed by the low-frequency spatial subband of all frames. Then, the sequences of spatial subbands are temporally decomposed using the MCTF with implied motion estimation into t temporal levels. Finally, each of the temporal transformed spatial subbands are further spatially decomposed into wavelet levels.

For a - motion compensated temporal subband decomposition, the values of and are determined by considering the context of the choice of temporal-spatial subbands used for watermark embedding. From now onwards, in this paper, we will use exact values of to represent various combinations of spatiotemporal decomposition, that is, . For example, (032), and (230) parameter combinations result in t + 2D and 2D + t motion compensated 3D subband decompositions, respectively. The same amount of subband decomposition levels can be obtained by also using the parameter combination (131) using the proposed generalized scheme implementation. The combination (002) allows 2D decomposition of all frames for frame-by-frame watermark embedding. The realizations of these examples are shown in Figure 2. We use the notation (, , , ) to denote the temporal subbands after a 3 level decomposition. We have described the use of this framework in combination with watermarking algorithms, in the next section.

3. Video Watermarking in 2D + t + 2D Spatiotemporal Decomposition

We propose a new video watermarking scheme by extending the wavelet-based image watermarking algorithms into 2D + t + 2D framework. In this section, we briefly revisit the wavelet-based image watermarking algorithms followed by the proposed video watermarking scheme. Then, we carry on to analyze various combinations in the proposed video decomposition framework to decide unique video embedding parameters, such as (1) choice of temporal subband selection and (2) motion estimation parameters, to retrieve the motion information from watermarked video.

3.1. Wavelet-Based Watermarking

Due to its ability for efficient multiresolution spatiofrequency representation of the signals, the DWT became the major transform for spread-spectrum image watermarking [1622]. A broad classification of such wavelet-based watermarking algorithms can be found in [23]. In this paper, we have chosen commonly used example algorithms to represent nonblind and blind watermarking algorithmic classes.

3.1.1. The Nonblind Case

A magnitude alteration-based additive watermarking is chosen as a nonblind case. In such an algorithm, coefficient values are increased or decreased depending on the magnitude of the coefficient, by making the modified coefficient a function of the original coefficient: where is the original decomposed coefficient at spatiotemporal subband, is the watermark weighting factor, is the watermark value to be embedded, and is the corresponding modified coefficient.

3.1.2. The Blind Case

In this category, we used an example blind watermarking algorithm as proposed in [20, 24], by modifying various coefficients towards a specific quantization step, . The method modifies the median coefficient by using a nonoverlapping running window, passed through the entire selected subband of the wavelet decomposed image. At each sliding position, a rank-order sorting is performed on the coefficients , and to obtain an ordered list . The median value is modified to obtain as follows: where is the input watermark sequence, is the weighting parameter, denotes a nonlinear transformation, and is the quantization step.

3.2. The Proposed Video Watermarking Scheme

The new video watermarking scheme uses the above algorithms on spatial-temporal decomposed video. The system block diagrams for watermark embedding, a nonblind extraction process, and a blind extraction process are shown in Figures 3, 4(a), and 4(b), respectively.

3.2.1. Embedding

To embed the watermark, first spatiotemporal decomposition is performed on the host video sequence by applying spatial 2D-DWT followed by temporal MCTF for a 2D + t (230) or temporal decomposition followed by spatial transform for a t + 2D (032). In both the cases, the motion estimation (ME) is performed to create the motion vector (MV) either on the spatial domain (t + 2D) or on the approximation subband in the frequency domain (2D + t) as described in Section 2.2. Other combinations, such as 131 and 002, are achieved in a similar fashion. After obtaining the decomposed coefficients, the watermark is embedded either using nonblind (12) or a blind watermarking algorithm (13) by selecting various temporal low- or highpass frames (i.e., or , etc.) and spatial subband within the selected frame. Once embedded, the coefficients follow inverse process of spatiotemporal decomposition in order to reconstruct the watermarked video.

3.2.2. Extraction and Authentication

The extraction procedure follows a similar decomposition scheme as in embedding and the system diagram for the same is shown in Figure 4. The watermark coefficients are retrieved by applying 2D + t + 2D decomposition on watermarked test video. For a nonblind algorithm, the original video sequence is available at the decoder and hence the motion vector is obtained from the original video. After spatiotemporal filtering on test and original video, the coefficients are compared to extract the watermark. In case of a blind watermarking scheme, the motion estimation is performed on the test video itself without any prior knowledge of original motion information. The temporal filtering is then done by using the new motion vector and consequently the spatiotemporal coefficients are obtained for the detection.

The authentication is then done by measuring the Hamming distance () between the original and the extracted watermark: where and are the original and the extracted watermarks, respectively. is the length of the sequence and represents the operation between the respective bits.

4. The Framework Analysis in Video Watermarking Context

Before approaching to the experimental results, in this section, we aim to address the issues related to MCTF-based video watermarking of the proposed framework. Firstly, to improve the imperceptibility, an investigation is made about the energy distribution of the host video in different temporal subbands, which is useful to select the temporally decomposed frames during embedding. Then, an insight is given to motion retrieval for a blind watermarking scheme, where no prior motion information is available during watermark extraction and this is crucial for the robustness performance.

4.1. On Improving Imperceptibility

In wavelet domain watermarking research, it is a well-known fact that embedding in high-frequency subbands offers better imperceptibility and low-frequency embedding provides higher robustness. Often wavelet decompositions compact most of the energy in low-frequency subbands and leave less energy in high-frequencies and due to this reason, high-frequency watermarking schemes are less robust to compression. Therefore, increase in energy distribution in high-frequency subbands can offer a better watermarking algorithm.

In analyzing our framework, the research findings show that different 2D + t + 2D combinations can vary the energy distribution in high-frequency temporal subbands and this is independent of video content. To show an example, we used Foreman sequence and decomposed using 032, 131, and 230 combinations in the framework and calculate the sum of energy for first two GOP each with 8 temporal frequency frames, namely, , , , , , , , and . In all cases, we calculate the energy for the low-frequency () subband of spatial decomposition. Other input parameters are set to macroblock, a fixed-size block matching motion estimation with search window. The results for percentage of energy (of a GOP) in each temporally decomposed frame are shown in Figure 5 and the histograms of the coefficients for 032, 131, and 230 of LLL and LLH are shown in Figure 6. The inner graph in Figure 6 represents the zoomed version of the local variations by clipping the -axis to show the coefficient distribution more effectively. From the results, we can rank the energy distribution in high-frequency temporal subbands as: . This analysis guides us to select optimum spatiotemporal parameter in the framework to improve the robustness while keeping better imperceptibility. We have performed the experimental simulation on 8 test videos: (Foreman, Crew, News, Stefan, Mobile, City, Football, and Flower garden) and all of them follows a similar trend.

4.2. On Motion Retrieval

In an MCTF-based video watermarking scheme, motion information contributes at large for temporal decomposition along motion trajectory. The watermark embedding modification in the temporal domain causes motion mismatch, which affects the decoder performance. While original motion information is available for a nonblind watermarking scheme, motion estimation must be done in the case of a blind video watermarking scheme. In this case, the motion vectors are expected to be retrieved from the watermarked video without any prior knowledge of the original motion vector (MV). Our study shows that, in such a case, more accurate motion estimation is possible by choosing the right 2D + t + 2D combination along with an optimum choice of macro block (MB) size. At the same time, we investigate the performance, based on motion search range (SR). Experimental performance shows that effectively SR has lesser contribution towards motion retrieval. The experiment set is organized by studying the watermarking detection performance by measuring Hamming distance of a blind watermark embedding at spatial subband on and temporal frames. The watermark extraction is done by using various combinations of MB and SR to find the best motion retrieval parameters. The results are shown in Tables 1 and 2 using average of the first 64 frames from Foreman and Crew CIF size video sequences, respectively, for 032, 131, and 230 spatiotemporal decompositions. Due to the limitations in macroblock size and integer pixel motion search,  MB search is excluded for 131 decomposition and ,  MB searches are excluded for 230 decomposition. It is noted that, in the video compression schemes, is the most commonly used MB while in this paper we have used various other MB sizes to investigate the effect on watermark retrieval.

The results show that for an MB size more than , 2D + t outperform t + 2D. In this context, the spatiotemporal decompositions can be ranked as . In the case of 131 or 230, the motion is estimated in hierarchically downsampled low-frequency subband. Therefore, number of motion vector reduces accordingly for a given macroblock size. This offers two-fold advantages.

(1) Complexity
The search range during the motion estimation is either half or quarter size of the full-resolution motion estimation. As a result, the searching time and computation complexity reduces significantly as follows: Let us assume motion is estimated for MB of with SR as shown in Figure 7. The complexity, , is calculated based on the number of search operations as given in (15): where is total number of pixels. As motion is estimated only on the downsampled low-frequency component, we can rewrite (15) as where is the spatial decomposition in the proposed scheme. Now, SR is a constant considering any given column in Tables 1 and 2 and hence it is evident that the complexity is inversely proportional to : Therefore, the complexity of various spatiotemporal decomposition can be ranked as , that is, complexity of proposed 2D + t scheme is much lesser when compared to traditional t + 2D.

(2) MV Error Reduction
At the same time, for blind motion estimation, less number of motion vector needs to be estimated at the decoder resulting in more accurate motion estimation and higher robustness. It is evident from Tables 1 and 2 that if the same number of motion vectors are considered, that is,  MB for 032,  MB for 131, and  MB for 230, the robustness performance is comparable for all three combinations. However, in subband of 2D + t, for a smaller MB, such as , more motion mismatch is observed as motion estimation is done in a spatially decomposed region. Now, using the analysis, above, we have designed experiments to verify our proposed video watermarking schemes for improved imperceptibility as well as robustness against scalable video compressions.

5. Experimental Results and Discussion

We used the following experimental setup for the simulation of watermark embedding using the proposed generalized 2D + t + 2D motion compensated temporal-spatial subband scheme. In order to make the watermarking strength constant across subbands, the normalization steps in the MCTF and the 2D DWT were omitted. There are two different sets of results obtained to show the embedding distortion and the robustness performance using luma component of 8 test video sequences ( YUV): (Foreman, Crew, News, Stefan, Mobile, City, Football, and Flower garden). However, within the scope of this paper, three test sequences are chosen to show the results according to their object motion activity, that is, high-motion activity (Crew), medium-motion activity (Foreman), and low-motion activity (News). We have used one nonblind and one blind watermarking scheme as example cases, described in Section 3.1. For the simulations shown in this work, the four combinations (032), (230), (131), and (002) were used. In each case, the watermark embedding is performed on the low-frequency subband () of 2D spatial decompositions due to its improved robustness performance against compression attacks in image watermarking. In these simulations, the 9/7 biorthogonal wavelet transform was used as the 2D decompositions.

Based on the analysis in the previous section, here we explored the possibility of watermark embedding in high-frequency temporal subband and investigate the robustness performance against compression attacks, as high-frequency subband can offer improved imperceptibility. In the experiment sets, we chose third temporal level highpass () and lowpass () frames to embed the watermark. Other video decomposition parameters are set to (1) 64 frames with GOP size of 8, (2) macro block size, and (3) a search window of . The choice of macro block size and search window are decided by referring the motion retrieval analysis in Section 4.2.

For embedding distortion measure, we used peak signal to noise ratio (PSNR) and also measured the amount of flicker introduced due to watermark embedding. Fan et al. [25] defined a quality metric to measure flicker in intracoded video sequences. In our experiments, we have measured flicker in a similar way by calculating the difference between average brightness values of previous and current frames and used the flicker metric in the MSU quality measurement tool [26]. The flicker metric here compares the flicker content in the watermarked video with respect to the original video. In these metrics, higher PSNR represents lower embedding distortion and for flicker, the lower values correspond to the better distortion performance. On the other hand, the watermarking robustness is represented by Hamming distance as mentioned in (14) and lower Hamming distance corresponds to better detection performance. Various scalable coded quality compression attacks are considered, such as Motion JPEG 2000, MC-EZBC scalable video coding, and H.264/AVC scalable extension (H.264-SVC). In these experiments, low-frequency spatial subband are selected within and temporal subbands. Therefore, the scheme is robust against respective spatial and temporal scalability. For example, the algorithm is robust against spatial scalability up to quarter resolution and temporal scaling up to and H frames. The results show the mean value of Hamming distance for average of first 64 frames of test video set.

The experiments are divided into two sets, one for embedding distortion analysis and the other for robustness evaluation. In all the experimental setup, we considered two watermarking algorithms, one each from nonblind (Section 3.1.1) and blind (Section 3.1.2) category. The weighting parameters and are set to 0.1. In case of nonblind algorithm, the level adaptive thresholding as described in [22] is taken into account to avoid watermark embedding in small or nearly zero coefficients to minimize the false detection. The watermarking payload is set to 2000 bits and 2112 bits using a binary logo for all combinations and every sequences for nonblind and blind watermarking methods, respectively.

5.1. Embedding Distortion Analysis

The embedding distortion results in terms of PSNR are shown in Figures 8, 10, and 12 for News, Foreman, and Crew sequences, respectively, for nonblind and blind watermarking methods. In each of the figures, -axis shows the frame number and -axis represents the PSNR. The flickering results are shown in Figures 9, 11, and 13 for News, Foreman, and Crew sequences, respectively. In these figures, the -axis represents the flicker metric as discussed in the previous section.

From the results for subband, it is evident that although the PSNR performances are comparable, proposed MCTF-based methods ((032), (131), and (230)) outperform the frame-by-frame embedding (002) in addressing the flickering problem. In all four combinations, the sum of energy in subband are similar and resulting in comparable PSNR. However, in the proposed methods, the error (i.e., PSNR) is propagated along the GOP due to hierarchical temporal decomposition along the motion trajectory and the error propagation along the motion trajectory addressed the issues related to flickering artifacts. On the other hand, for subband, due to temporal filtering, the sum of energy is lesser and the four combinations can be ranked as . Hence, the PSNR and flickering performance for this temporal subband can be ranked as . Therefore, while choosing a temporally filtered high-frequency subband, such as , , or , the proposed MCTF approach also outperforms the frame-by-frame embedding in terms of PSNR while addressing the flickering issues. It is evident that flickering due to frame-by-frame embedding is increasingly prominent in the sequences with lower motion (e.g., ) and is successfully addressed by the proposed MCTF-based watermarking approach.

5.2. Robustness Performance Evaluation

The robustness results for the nonblind watermarking method are shown in Figures 14, 15, and 16 for Crew, Foreman, and News sequences, respectively. The -axis represents the compression ratio (Motion JPEG 2000) or bitrates (MC-EZBC and H.264-SVC) and -axis shows the corresponding Hamming distances. Columns (1) and (2) show the results for the and frame selections, respectively. The robustness performances shows that 2D + t, for example, any combination of temporal filtering on spatial decomposition (i.e., (131) and (230)) outperforms a conventional t + 2D scheme. The experimental robustness results for blind watermarking method are shown in Figures 17, 18, and 19 for Crew, Foreman, and News sequences, respectively. Column 1 shows results for the temporal subband while results for are shown in Column 2. The rows represent various scalability attacks, Motion JPEG 2000, MC-EZBC, and H.264-SVC, respectively. In this case, the motion information is obtained from the watermarked test video. Similar to the nonblind watermarking 2D + t again outperforms a conventional t + 2D scheme such as in [14]. We now analyze the obtained results by grouping it by selection of temporal subband, by embedding method, and by compression scheme.

5.2.1. Selection of Temporal Subband

The low-frequency temporal subband () offers higher robustness in comparison to high-frequency subband. This is due to more energy concentration in subband after temporal filtering. Within the temporal subbands, in subband, various spatiotemporal combinations perform equally as the energy levels are nearly equal for 032, 131, and 230. However, 230 performs slightly better due to lesser motion-related error in spatially scaled subband. On the other hand, for subband, we can rank the robustness performance as as a result of the energy distribution ranking of these combinations in Section 4.1.

5.2.2. Embedding Method

For a nonblind case, the watermark extraction is performed using the original host video and hence the original motion vector is available at the extractor which makes this scheme more robust to various scalable content adaptation. On the other hand, as explained before, the blind watermarking scheme neither have any reference to original video sequence nor any reference motion vector. The motion vector is estimated from the watermarked test video itself which results in comparatively poor robustness. The effect of motion related error is more visible in subband as the motion compensated temporal highpass frame is highly sensitive to motion estimation accuracy and so the robustness performance. As discussed in Section 4.2 in case of a 2D + t (i.e., 230), the error due to motion vector is lesser compared to t + 2D scheme and hence offers better robustness ().

5.2.3. Compression Scheme

We have evaluated our proposed algorithm against various scalable video compression schemes, that is, Motion JPEG 2000, MC-EZBC, and H.264-SVC. First two video compression schemes are based on wavelet technology whereas more recent H.264-SVC uses layered scalability using base layer coding of H.264/AVC.

In Motion JPEG 2000 scheme, the coding is performed by applying 2D wavelet transform on each frame separately without considering any temporal correlation between frames. In the proposed watermarking scheme, the use of 2D wavelet transform offers better association with Motion JPEG 2000 scheme and hence provides better robustness for 2D + t combination for and . Also in the case of subband, a better energy concentration offers higher robustness to Motion JPEG 2000 attacks. The robustness performance against Motion JPEG 2000 can be ranked as .

MC-EZBC video coder uses motion compensated 1D wavelet transform in temporal filtering and 2D wavelet transform in spatial decomposition. In compression point of view, MC-EZBC usually encodes the video sequences in t + 2D combination due to better energy compaction in low-frequency temporal frames. But in watermarking perspective, higher energy in high-frequency subband can offer higher robustness. The argument is justified from the robustness results where results for subbands are comparable, but a distinctive improvement is observed in subband and based on the results the robustness ranking for MC-EZBC can be done as .

Finally, we have evaluated the robustness of the proposed scheme against H.264-SVC, which uses inter- /intramotion compensated prediction followed by an integer transform with similar properties of DCT transform. Although the proposed watermarking and H.264-SVC video coding scheme do not share any common technology or transform, the robustness evaluation of the proposed method, against H.264-SVC, has been carried for the completeness of the paper for different scalable video compression schemes. The results provide acceptable robustness. However, for a blind watermarking scheme in subband, proposed schemes performs poorly due to blind motion estimation. Similar to previous robustness results, based on energy distribution and motion retrieval argument, here we can rank the spatiotemporal combinations as . In a specific example case, H.264-SVC usually gives preference to intraprediction to the sequences with low global or local motion, as in News sequence and hence exception in robustness performance to H.264-SVC is noticed for the proposed scheme.

It is evident that, due to close association between the proposed scheme and MC-EZBC, robustness of the proposed scheme offers best performance against MC-EZBC-based content adaptation. To conclude this discussion, we suggest that a choice of 2D + t watermarking scheme improves the imperceptibility and the robustness performance in a video watermarking scenario for a nonblind as well as a blind watermarking algorithm.

6. Conclusions

In this paper, we have presented a new motion compensated temporal-spatial subband decomposition scheme, based on the MCTF with implied motion estimation for video watermarking. The MCTF was modified by taking into account the motion trajectory in obtaining an efficient update step. The proposed 2D + t domain watermarking offers improved robustness against scalable content adaptation compared to state-of-the-art conventional t + 2D video watermarking scheme in a nonblind as well as a blind watermarking scenario. The robustness performance is evaluated against scalable coding-based quality compressions attacks, including Motion JPEG 2000, MC-EZBC, and H.264-SVC (scalable extension). The proposed subband decomposition also provides low complexity as MCTF is performed only on subbands where the watermark is embedded.

Acknowledgment

This work is funded by the UK Engineering and Physical Sciences Research Council (EPSRC) by an EPSRC-BP Dorothy Hodgkin Postgraduate Award (DHPA).