Abstract

This paper presents a steganalytic approach against video steganography which modifies motion vector (MV) in content adaptive manner. Current video steganalytic schemes extract features from fixed-length frames of the whole video and do not take advantage of the content diversity. Consequently, the effectiveness of the steganalytic feature is influenced by video content and the problem of cover source mismatch also affects the steganalytic performance. The goal of this paper is to propose a steganalytic method which can suppress the differences of statistical characteristics caused by video content. The given video is segmented to subsequences according to block’s motion in every frame. The steganalytic features extracted from each category of subsequences with close motion intensity are used to build one classifier. The final steganalytic result can be obtained by fusing the results of weighted classifiers. The experimental results have demonstrated that our method can effectively improve the performance of video steganalysis, especially for videos of low bitrate and low embedding ratio.

1. Introduction

Steganography, as the art and science of data hiding, realizes covert communication under the camouflage of innocent-looking cover media. It will not arouse eavesdroppers’ suspicion because the perceptual and statistical characteristic of embedded file is similar to that of original unaltered counterpart. Facilitated by advanced video compression and computer network technology, digital video has become one of the most influential media. And the boom of highly interactive multimedia applications has created an urgent need to explore the steganography of hiding data into digital videos.

Up to date, video steganographic methods are usually integrated into the video compression process. Such approaches hide information by modifying certain output coefficients during compression procedure, such as MVs [16], interprediction modes [7], quantized DCT coefficients [810], and variable length codes [11, 12]. In this paper, we focus on attacking MV-based steganography. There are two advantages making MV-based steganography superior to others. First, rich motion information in compressed video streams guarantees sufficient embedding capacity. Secondly, the modification applied to MV will not affect the coding performance much.

In this paper, we focus on attacking MV-based steganography. Many MV-based steganographic methods have been proposed recently. Jordan et al. [1] embedded message bits by modifying the LSBs of nonzero MVs’ horizontal and vertical components. Xu et al. [2] suggested modifying the MVs whose magnitudes are above a given threshold. Aly [3] chose the candidate MVs according to their associated prediction errors. By applying mature coding techniques such as wet paper codes (WPCs) [13] and syndrome-trellis codes (STCs) [14, 15] to video steganography, adaptive steganographic schemes have been presented in recent years. In Yao et al.’s work [4], an adaptive MV-based steganography was proposed by considering the statistical distribution change and the prediction error change. And two-layered STCs [15] are used to minimize distortion for embedding process. In order to resist the steganalytic schemes based on MV’s local optimality [16], several approaches were proposed. In [5], Cao et al. exploited the opportunity to optimize the ME perturbation using the loss caused by video compression process. The data embedding was implemented using a double-layered (first channel: STCs [15]; second channel: WPCs [13]) coding structure with distortion scale calculated on optimal neighbors. In Wang et al.’s work [6], data was embedded based on the distortion defined by considering motion characteristic of video content, MVs local optimality, and statistical distribution.

In order to reveal the existence of hidden message, current video steganalysis divides the video into detection intervals (DI) with fixed-length and then extracts feature from every DI. The calibration-based approach is a typical steganalytic method. In Wang et al.’s work [17], the calibration-based steganalysis is further improved by matching the parameters between the first and second compression process. Wang et al. [16] extracted features based on the difference between the actual the sum of absolute difference (SAD) and locally optimal SAD after the adding-or-subtracting-one operation on MVs. Recently Zhang et al. [18] suggested checking the local optimality of MVs by considering both distortion and bit estimation associated with MVs. And near-perfect estimation for local optimality is utilized to detect MV-based steganography.

Although various steganalytic approaches have been presented, there are still many challenges in the field of video MVs targeted steganalysis. Just as modifications are implemented in textured regions in image adaptive steganography [19, 20], adaptive approaches [46] in video steganography can also constrain their embedding changes to those parts that are difficult to model, such as rich motion frames. As a consequence, the adaptive steganography has become the research focus due to its high embedding capacity and enhanced security. The basic principle of the current video steganalysis is to analyze the embedding perturbation and statistical changes within the fixed-length DIs. However, the embedding changes are not only correlated with the steganographic methods, but also with the video content.

Consequently the restrictions of current video steganalysis can be concluded as follows. Firstly, compared with the processing of embedding, the video content makes a more significant impact on the differences of video statistical characteristics. Moreover, the detection accuracy of steganalytic method relies on the performance of classifiers. Therefore, if the contents of training and testing videos are different a lot in motion intensity, the result of classification will be affected obviously.

In order to solve this problem, a steganalytic method using motion based segmentation is proposed in this paper. The main contributions of this paper include segmenting the whole videos to subsequences according to the block’s motion and extracting steganalytic features from categories of subsequences with close motion intensity; building the model of multiple subclassifiers; and fusing the results of weighted subclassifiers to obtain the final steganalytic performance.

The organization of the rest paper is as follows. In Section 2, our steganalytic approach including video segmentation and decision fusion is proposed. Section 3 shows the experimental results and the conclusions and future works are given in Section 4.

2. Motivation

In current video steganography [46], the data is adaptively embedded by modifying the MVs according to video content. Because the changes of statistical characteristics are different in frames with different motion intensity, the features extracted from fixed-length DIs are not effective, which can not cope with adaptive steganography well.

In order to demonstrate this phenomenon, we utilize Cao’s method [5] to embed information into the video Coastguard. The NPELO features [18] are extracted from every 12 frames. Then the features are subjected to a trained classifier. Figure 1 shows the corrupted MV number and the corresponding decision value. It is observed that the feature’s effectiveness is greatly affected by the motion intensity in video content. And the features drawn from frames with high motion intensity are usually more effective than that from other frames.

In image steganalysis, many frameworks [2124] considering image content have been proposed, which guide the proposal of our approach. If we can extract features from subsequences of different motion intensity, the classifiers with features of different effectiveness can be trained independently. By assigning high weight values to effective classifiers, the overall steganalytic performance is expected to be improved.

3. The Proposed Steganalysis

In this paper, the video is segmented to several subsequences which are sorted by the motion intensity. The effectiveness of features extracted from the categories with rich motion is to be improved, which make it easier to distinguish the stego videos from the cover ones.

The schematic diagram of segmentation based steganalysis is shown in Figure 2. Compared with directly extracting features from fixed-length DIs of the whole video in traditional video steganalysis, the input videos are firstly segmented to subsequences both in the training and in testing processes. Then the features are extracted from DIs in category of subsequences with different motion intensity. In the training process, the cover and stego video pairs in the training set are segmented according to video content and then the features of the subsequences with similar motion intensity are subjected to train one classifier. Consequently, classifiers are trained after this process. In the testing process, the features of different categories of subsequences are fed to different classifiers. The final result is obtained by fusing the results of classifiers assigned with different weight values.

3.1. Video Segmentation

Inspired by the motion continuity in video content, we first segment the video into subsequences by linking the blocks among adjacent frames. Subsequently, based on the characteristics of the linking, the subsequences are sorted to categories of different motion intensity.

3.1.1. MV Flow Based Segmentation

As the integral part of existing video coding standards, Motion Estimation (ME) is designed to reduce the temporal redundancy between video frames. This is achieved by allowing blocks of pixels from currently coded frame to be matched with those from reference frame(s). As a result of ME, MV represents the spatial displacement offset between a block and its prediction.

Therefore, MVs’ values are greatly determined by the ME performance. MVs obtained by different ME methods vary a lot. It is accepted that a moving object is usually very different from the static background. Therefore, if MVs do not reflect the real motion, the corresponding prediction residuals are relatively large. Based on this principle, we define “credible MV” as follows.

Definition 1 (credible MV). A certain MV is deemed to be credible if it satisfiesHere, SAD is the sum of absolute difference between macroblock (MB) and its prediction and is the standard deviation of SAD. This formula computes the difference between the current SAD and the average SAD of its neighbors.
If the difference does not exceed the preset threshold, the SADs’ variety in this area is equable and there is no singularity at the center. It means that compared with its neighbors, the SAD of this macroblock is not distinctly large. Therefore, its MV can indicate the direction and magnitude of real motion, which is called a credible MV.
Inspired by the proverbial concept optical flow [25] in pixel-domain action recognition, we bring forward “MV Flow.” In pixel domain, the dense optical flow of frame is computed by tracking pixel point in the next frame , where and are the horizontal and vertical components. In spite of different implementation domain and details, this process is similar to ME in video compression. And as the outcome of locating macroblock in its reference frame, MV has some same characteristics with . Thus in our method, the definition of “MV Flow” is given as below.

Definition 2 (MV Flow, MVF). The MV Flow of frame is defined as where is the number of macroblocks and is the credible MV matched by referring MBs in frame .

As shown in Figure 3, in every frame, the MVF can be computed from their original credible MVs. In frame , equals , which is the MV obtained by referring to the corresponding in frame . In frame , according to the operation rule of vector (the Triangle Rule), we can get . And similarly from analyzing the constraints of the credible MVs, we can get in frame .

In MVF field, the macroblocks’ shifts between two adjacent frames can be interpolated using credible MVs. As shown in Figure 4, given a compressed video sequence , every frame’s MVF can be obtained by computation shown in Figure 3. The calculated MV points to the location of corresponding macroblock in frame . If the location crosses macroblock boundary, the one with largest overlapping area is selected as the best matching macroblock. In order to link the associated blocks with same size, the block of larger size is subblocks. The principle of this manipulation is based on the motion continuity of video content. If there is motion change or shot switching in this video, the whole video will be segmented into several subsequences by linking the blocks.

3.1.2. Classification of Subsequences

In this subsection, the classifying strategy of subsequences is proposed. First we define the distance to measure the motion in this subsequence and then we proceed to classify the subsequences to different categories with different motion intensity.

As discussed in Section 3.1.1, similar blocks can be linked from MVF. Derived from the distance measurement [26] in pixel domain, we propose the “Linking Distance” to measure motion of block linking during this subsequence. The similar blocks are linked by , which are vectors on plane along the discrete time axis. Thus the linking can be denoted by , where . The definition of “Linking Distance” of the sequence is given as follows.

Definition 3 (Linking Distance). The Linking Distance (LD) of this subsequence is defined aswhere is the LD of the th subsequence of the video, is the number of subsequences, and is the number of linking in this subsequence. The motion intensity of the subsequence can be measured by calculating the LD of all of the blocks’ linking. By setting several threshold values of LD, the th subsequences are classified into categories with different motion intensity, which is formulated as follows:where is the set of thresholds to classify the subsequences and is the number of categories, which is corresponding to classifiers in the training and testing processes.

As described in Figure 4, the segmentation of compressed video can be realized by linking the similar blocks and classifying the subsequences. The implementation procedure is illustrated in Figure 5. First, when inspecting the MV field, credible MVs are selected to guarantee the representability of macroblocks’ real motion. Based on the credible MVs’ relevance between frames, MVF can be obtained. Subsequently, under the maximum-overlapping principle, the similar blocks are linked between adjacent frames, which results in dividing the whole video into several subsequences. Then LD of every subsequence is computed and will be further used for classifying the subsequence to one of the categories with settled thresholds. As a consequence, all of the subsequences are sorted to categories, from which the kinds of features are extracted and then subjected to the corresponding classifiers.

The proposed video segmentation method is based on the ME process in video compression. In order to dispose the intracodec blocks’ influence, we test several 720P sequences with QP as 28. It is found that only 5%-6% macroblocks are intracoded in and frame. Thus the effect of intra-MB can be negligible in our method.

3.2. Decision Fusion

After segmenting the videos into subsequences sorted by motion intensity, the steganalytic features are extracted from each category of subsequences, which are utilized for training or testing. As shown in Figure 2, the features extracted from different categories are used to train classifiers in the training process. And in order to test the video, the features of categories are input to the corresponding trained classifiers, which output detection results.

Inspired by the fusing methods in image steganalysis [2124], we assign weight value to each classifier. The weight value of the th classifier is defined bywhere is the detection accuracy of th classifier and and are the rates of true positive and true negative respectively. If the detection accuracy of a specific category of subsequences is relatively higher, it means that more subsequences are correctly detected as stego. Thus a bigger weight value should be assigned to this type of classifier, and vice versa.

In the voting process, the decision values of subsequences are set in the changing range of 0 to 1, of cover subsequence is 0 and of stego one equals 1. The final decision value is obtained by voting as follows:The input video is detected as a stego video if and cover video otherwise. The final detection result of every single video can be obtained by this fusing manipulation. And the detection performance of the proposed method is evaluated by the detection accuracy of the weighted classifier set, which is the average value of TP and TN of testing video set.

4. Experiments

4.1. Experimental Setup

Our proposed steganalytic scheme is implemented on a well-known H.264/AVC codec named x264 [27]. The video database is composed of 100 standard 4 : 2 : 0 YUV sequences in CIF format. The raw sequences vary from 150 to 300 frames in length and are coded with 30 fps frame rate.

In order to evaluate our adaptive steganalytic strategy against existing MV-based steganography, Cao’s [5], Yao’s [4], and Wang’s [6] methods are implemented to generate the class of stego videos. The embedding ratio is denoted by corrupted MV ratio (CMVR), which represents the ratio of corrupted MVs’ number to the total number of MVs in each frame. Various bitrates (BR) including 0.5 Mbps, 1 Mbps, 3 Mbps, and 10 Mbps are considered with the achieved embedding ratio (ER) of CMVR = 0.1 and CMVR = 0.2, respectively.

In our experiments, the current best steganalytic features NPELO [18] and MVRBR [17] are leveraged to extract features from cover and stego samples. We randomly select 50 percent pairs of videos for training and the remaining ones for testing. Each training and testing is repeated several times and average detection accuracy is used to evaluate the final performance. Moreover, Chang and Lin’s support vector machine (SVM) [28] with Gaussian kernel is utilized as classifier.

4.2. Results and Discussion

First, in order to investigate the relationship between categories’ number and steganalytic performance, we test the proposed scheme under the conditions of different categories’ number. The stego samples are generated by Cao’s, Yao’s, and Wang’s steganographic methods at the CMVR of 0.2. And the bitrate is set at 10 Mbps. Both of improved NPELO and improved MVRBR are utilized for feature extraction. By assigning different values to in Formula (4), the video subsequences can be segmented to several categories. And the credible MV is defined by the threshold in Formula (1).

Table 1 records the detection accuracies with the corresponding values of . As illustrated in Figure 6, the steganalytic performance improves with the increase of categories’ number and stabilizes at the number 3 to 5. With the purpose of reducing computation and time complexity, the video subsequences are segmented to three categories of low, middle, and high motion intensity with the parameters and in the following experiments.

Because the statistical characteristics of MVs could be significantly influenced by the variations in coding parameters and embedding ratios, the steganalytic performances against Cao’s, Yao’s, and Wang’s steganography are further evaluated under the various configurations of bitrates and CMVRs.

The detection accuracies of the original NPELO and the proposed methods against current MV-based steganographic algorithms are recorded in Table 2. Figure 7 depicts the comparison of their performances in detecting Cao’s, Yao’s, and Wang’s methods, respectively. It can be seen that the proposed approach performs better than NPELO method when detecting all the three steganographic schemes. Our method achieves the detection accuracy of 99.72% when detecting Yao’s scheme with CMVR = 0.2 and 3 Mbps. And the maximum detection accuracies against Cao’s and Wang’s methods are up to 98.34% and 97.77%.

The performances of original and improved MVRBR are illustrated in Figure 8. It is shown that the proposed approach can effectively improve the accuracy in detecting above three steganographic methods. From Table 3, we can observe that the maximum accuracies are achieved when CMVR is 0.2 and bitrate is 10 Mbps. And the maximum accuracies are equal to 96.19%, 99.24%, and 94.73% against Cao’s, Yao’s, and Wang’s steganographic methods. The steganalytic performances of these methods meliorate with the increase of QP. It is because less losses are induced in higher quality videos, and the features extracted from these subsequences are more effective.

The average improvement percentage of NPELO and MVRBR is 2.02% and 6.36%, respectively. Consequently our approach can effectively improve the performance of steganalytic feature, especially in the cases of low bitrates and low embedding ratios.

Moreover, in order to analyze the impacts of three categories on overall result, steganalytic performances of high, middle, and low motion intensity are tested. As a result, the detection accuracies are recorded in Table 4. As shown in Figures 9 and 10, the performance of high motion intensity is best among the three categories, whereas the low motion intensity performs worst. Correspondingly, the weight value of third classifier is largest and small weight value is assigned to the first classifier. Therefore, the category of high motion intensity makes greatest contribution to the improvement of steganalytic performance, followed by the one of middle motion intensity and the low motion intensity’s contribution is least.

5. Conclusion and Future Works

In this paper, a segmentation based steganalytic scheme aimed at MV-based video steganography is proposed. In order to reduce the difference of statistical characteristics caused by diverse video content, the input videos are segmented to subsequences according to block’s motion in each frame. Then features are, respectively, extracted from every category of subsequences with close motion intensity, which are used to train one classifier independently. In the testing process, the steganalytic features of each category are sent to the corresponding classifier and the final decision is made through a weighted fusing process. The results of the experiments have shown that the proposed algorithm can effectively improve the performance of video steganalysis, especially for low bitrate videos which are embedded at low embedding ratio.

In our future work, a larger database will be used for evaluation of our method and the videos are to be classified into more categories with different motion intensity. And the implementations on other video coding standards such as H.265/HEVC will be further considered.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the NSFC under U1636102 and U1536105 and National Key Technology R&D Program under 2014BAH41B01, 2016YFB0801003, and 2016QY15Z2500.