Abstract

To detect frame duplication in degraded videos, we proposed a coarse-to-fine approach based on locality-sensitive hashing and image registration. The proposed method consists of a coarse matching stage and a duplication verification step. In the coarse matching stage, visually similar frame sequences are preclustered by locality-sensitive hashing and considered as potential duplication candidates. These candidates are further checked by a duplication verification step. Being different from the existing methods, our duplication verification does not rely on a fixed distance (or correlation) threshold to judge whether two frames are identical. We resorted to image registration, which is intrinsically a global optimal matching process, to determine whether two frames coincide with each other. We integrated the stability information into the registration objective function to make the registration process more robust for degraded videos. To test the performance of the proposed method, we created a dataset, which consists of subsets of different kinds of degradation and 117 forged videos in total. The experimental results show that our method outperforms state-of-the-art methods for most cases in our dataset and exhibits outstanding robustness under different conditions. Thanks to the coarse-to-fine strategy, the running time of the proposed method is also quite competitive.

1. Introduction

With various nonlinear editing tools such as Adobe Premiere, Microsoft Movie Maker, and Sony Vegas, it is now much easier for people to tamper the content of a video. Many different kinds of detection methods have been proposed [13]. Among different approaches to video forgery, frame duplication, which simply copies a sequence of frames to another position in the timeline, may be one of the most convenient yet effective means to hide or counterfeit events. The forged part of the video can be easily made visually natural and therefore difficult for manual detection. Fortunately, since the source and target frames simultaneously exist in the video, frame duplication forgery can be exposed by detecting abnormal identical frame sequences. On this basis, several methods have been proposed [48]. These methods share a common methodology to judge whether a frame is the copy of another one. These methods share a common methodology to judge whether a frame is the copy of another one. They extract features from the frames and set the distance threshold between the features. Such methodology makes it difficult for these methods to perform robustly when applied in realistic frame duplication detection (FDD), where degradation is quite common. In degraded videos, the local structure of the frames can be altered slightly by sorts of factors, and then the value of the extracted features correspondingly changes; therefore, a fixed distance threshold cannot always work well. For instance, an experienced attacker may add a little perturbation (e.g., additive noise) to either the source or the target frames; then, in this noisy scenario, it is quite probable that the threshold tuned for ordinary cases will miss some true matching frame pairs. In fact, even the lossy encoding process will result in substantial difference between the source and target frames; please see Figure 1 for an example. Since a tampered video will be subject to at least double compressions, this example indicates that degradation in video forgeries is almost unavoidable. This makes it quite complicated to stably detect frame duplication in realistic scenarios.

In this paper, we are attempting quite a different methodology for FDD. In the proposed method, we no longer rely on a fixed threshold to decide whether a video subsequence is the duplication of another. The key idea is that for any two frames and , if they contain the same objects, meanwhile the corresponding objects’ shape and position are identical, and can be considered as the copy of each other. We resort to image registration to check whether the three aspects (objects, objects’ shape, and objects’ position) of two frames coincide with their counterparts. Specifically, the problem is solved by pixel-level global optimal matching. When a given frame is aligned to its copy, there should not be any distortion in the resulted offset field, i.e., each pixel in the source frame matches to the same location in the target frame. Our method is robust to certain magnitude of video degradation, but the global matching procedure is not so fast as some feature extraction and comparison-based methods such as [4]. For acceleration, our pipeline involves a coarse matching step, which can significantly improve the computational speed.

It should be noted that, like the other methods in this field [48], our method also does not take the static scenes into consideration. Therefore, our method can be used to expose counterfeit events. For instance, an attacker can copy a sequence of frames from the historical record to counterfeit the event that a man passing through a scene; then, the duplicated frames can be used as the evidence of being present or absent and therefore plant suspicion on that man or absolve him from guilty.

The contributions of this paper are listed as follows.

First, we propose a coarse-to-fine FDD scheme, whose first key step is preclustering perceptually similar sequences of frames by locality-sensitive hashing (LSH). Through this coarse matching step, the computation load for finer duplication verification can be reduced by several orders of magnitude.

Second, we use global optimal matching for finer duplication verification, with the benefit regarding computational cost obtained in the coarse matching step. We integrate the stability information of different regions into the matching objective, resulting in a robust yet sensitive matcher for the noisy environment.

The rest of this paper is organized as follows: in section “Related Work,” we briefly introduce related work, and then in section “Proposed Method,” the proposed method is detailed. The experimental results are presented in section “Experimental Results.” The conclusion and future directions are finally drawn in section “Conclusion and Future work.”

To the best of our knowledge, there are only a few works on FDD. In [8], Wang and Farid proposed the first FDD method. The video is divided into overlapping short subsequences; then, for each subsequence, temporal correlation between each pair of frames and the block-wise spatial correlation within each frame are calculated and used as features for subsequence comparison. The subsequences are compared with each other in a coarse-to-fine manner. High similarity in temporal correlation coefficients will trigger the comparison between spatial correlation coefficients [9]. Given the number of frames, , the time complexity of the subsequence comparison process is . Methods using the same framework include [4, 6], which, respectively, use structural similarity (SSIM) [6] and histogram correlation to measure the similarity between subsequences.

Being different from [8], in [5, 7], the authors lexicographically sorted the features (Tamura texture and local binary pattern are used as features in [5, 7], respectively) corresponding to each frame; then, neighbouring features which are close enough to each other in the feature space correspond to duplicated frames. In this manner, the time cost spent on identifying matching frames is theoretically reduced to , where is the feature dimension. It should be noted that, for lexicographical sorting, the features corresponding to the frames of the entire video have to be simultaneously stored in the memory. From an implementation perspective, in memory-constrained environments, with the increase of and , the storage needed by lexicographical sorting can sometimes exceed the memory capacity. In such cases, the features have to be stored on the disk instead; then, the sorting procedure will involve frequent disk access which is rather slow.

Taking the characteristic of the encoding process into consideration, Subramanyam and Emmanuel [10] extracted the histogram of oriented gradients (HOG) features from block pairs colocated in neighbouring I, B and P, B frame pairs; then, high correlation between the HOG features discloses duplicated blocks and hence frames. However, this method can only be used to detect the case that the source and target frames are placed next to each other, which is quite uncommon.

The methods discussed above unexceptionally detect the duplication behaviour by a fixed global threshold, which makes them less robust. Although some features can be, to some extent, robust to degradation, it is obvious that a threshold calibrated for a certain condition may not be suitable for others. This problem becomes particularly noticeable when processing degraded videos since there are too many unstable factors caused by, e.g., compression artifact or manually added noise.

One subject closely associated with FDD is near-duplicate identification (NDI) [1114], whose major concern is the copyright issues. Unlike FDD, in NDI, the video clip in query is known, while the goal of FDD is to find all duplicated frame pairs within the video sequence where each frame can be possibly forged; therefore, FDD is more challenging in terms of time complexity, which increases quadratically with the number of frames [11]. Another major difference between FDD and NDI is that, in NDI, the potential attacks can be much stronger than in FDD, and the pirated videos can be geometrically transformed (e.g., picture in picture or recaptured videos) or inserted into logos or subtitles. In this sense, methods for NDI should be more robust while less discriminative than FDD.

3. Proposed Method

3.1. The Pipeline

We detect frame duplication in a coarse-to-fine manner. The pipeline consists of two key steps: coarse matching and duplication verification, as shown in Figure 2. The main concern of the coarse matching stage is to significantly reduce the burden of computation for the second stage. Given an input video , following [8], we firstly divide into overlapping subsequences, where is the total number of frames and is the subsequence length. For each subsequence , using LSH, we identify from ’s successive subsequences the ones that are visually similar to . In this way, we cluster subsequence and its duplication candidates into the same group. Then, in the duplication verification stage, we perform image registration between each pair of corresponding frames, respectively, in and the duplication candidates. Through image registration, we obtain a series of offset fields, and the zero-offset fields verify duplications.

3.2. Coarse Matching by LSH

The left part of Figure 2 depicts the process of coarse matching for a given subsequence . For subsequence , we would like to find out, from subsequence to , the duplication candidates (where ), which are perceptually similar to , such that the duplication verification procedure, which is more accurate but slower, can compare only with instead of with all the succeeding subsequences of . To this end, what we need is a feature which is sensitive to content (i.e., objects, objects’ shape, and objects’ position) change while robust against image degradation. Although many image hashing schemes can be used for this purpose (e.g., the wavelet-based [15] and SVD-based [16]), we find that a block-wise GIST [17] feature meets our requirement best in terms of the tradeoff between robustness, discriminative power, and computational time. For each frame, we extract the block-wise GIST descriptor; then, the descriptors extracted from the frames in the subsequences and are, respectively, concatenated to form one-dimensional features for the corresponding subsequences.

We exploit LSH to determine whether a feature is sufficiently close to . Given an error probability and a distance threshold , when , LSH guarantees that , where is the collision probability for the hash value of and . In this paper, we use the -stable distribution-based LSH [18]:where is the feature vector being hashed, is a real-valued vector whose elements are independently drawn from a standard normal distribution , which has been proven to be -stable [18] (since we use the -norm to measure the difference between features), and is a real scalar uniformly drawn from , where is a real scalar.

To produce more reliable results, we construct hash tables, and subsequence is considered as a duplication candidate of only when the hash values of and collide more than times.

If the collected duplication candidates include or more consecutive subsequences ( in our experiments, i.e., about 0.5 second), then these consecutive subsequences are considered as static scenes and discarded.

Note that this coarse matching stage also involves a distance threshold, , but this distance threshold is different from that used in the existing works in that is not used for making the final decision. The coarse matching step is used to eliminate unnecessary computations; therefore, when choosing , we do not have to consider much about the tradeoff between robustness and distinctiveness; we should just guarantee that the duplication of is a subset of . In fact, in practice, is not necessarily explicitly assigned, and we will discuss this in more detail in the Experimental Results section.

At the end of this stage, each subsequence is associated with a set of potential duplication candidates . The duplication verification step is then performed between and these candidates.

3.3. Duplication Verification

For each pair of subsequence and its duplication candidate , we perform image registration between the corresponding frames to check whether the two frames contain the same objects and whether the shape and position of the corresponding objects happen to be identical. If so, the registration will yield zero-valued offset fields. However, it is not easy to stably obtain correct registration results for degraded images. As shown in Figure 1, even the lossy compression itself can result in substantial changes between a frame and its copy, which usually cause registration faults. To solve this problem, we propose to find the stable regions in the frames and rely on these regions more than the unstable areas during the registration procedure. We use a variant of Harris cornerness response proposed in [19] to measure the stability of the local structure of a pixel:where , , and ; denotes a given frame, is a two-dimensional Gaussian kernel, and ∗ denotes the convolution operation.

For a frame , a large value at indicates that both eigenvalues of the autocorrelation matrix corresponding to are large. This means that the signal changes significantly in two orthogonal directions; such points have been shown to be stable under various conditions except for scale change [20, 21]. We use to weight different regions in a frame during the registration process, and the registration objective can be written aswhere is the data term which measures the difference between the local structures around the matching pixels, is the smoothness term which guarantees that neighbouring pixels have similar offsets, is the offset for point , and denotes an edge in the -neighbourhood system .

are are weighting matrices such thatwhere , with being the normalized version of . We use the maximal filtering to diffuse the impact of the stable points to a small range around them.

in (5) is quite a small value ( in our implementation) used to mask out the excessively smooth regions when computing the data term (3). Although the Harris cornerness response weights the smooth regions less, the areas which are excessively flat still cause troubles during registration. The local structure of such regions can be easily changed by small perturbation. Figure 1(f) is an apparent illustration of this phenomenon, where large bright spots (significant differences between visual word indices) all locate on the wall or the floor, which are both rather smooth. Such excessively smooth regions will result in high data cost which is inconsistent with the real situation. Based on this observation, we set the threshold to remove the impact of the data costs in those areas. As a consequence, the offset field within those regions is completely controlled by the smoothness term (4); we therefore add a truncation value in (6) to guarantee that the smoothness constraint is always above a certain level.

The data term (3) and smoothness term (4) are, respectively, defined aswhere is the feature for point in the source frame, denotes the feature for in the target frame (we use the SIFT descriptor [22] extracted on a single scale as the feature for each pixel), and and denote the offset in horizontal and vertical directions, respectively. We use the truncated -norm in (8) to account for discontinuities in the offset field, and is used for balancing the data term (3) and smoothness term (4) ( can also be combined into (5), we write it here for clearness).

We use the dual-layer loopy belief propagation in [23] to minimize the objective function . By decoupling the smoothness term in (4) into two parts in (8) corresponding to two directions, the complexity of message update in each iteration is reduced from to , where is the number of pixels in each frame and is the number of possible offsets in each direction. The complexity is further reduced to by the distance transform proposed in [24]. The multigrid message passing scheme in [21] is also exploited to significantly reduce the total number of iterations.

Optical flow (e.g., [25]) or SIFT flow [23] can also be used for image registration, which is intrinsically a pixel-wise correspondence estimation process. However, neither of them can obtain expected results for degraded videos. The difference between our objective and that of optical flow and SIFT flow is obvious: we encode the stability information of different regions into the matching objective, which makes our method quite robust against video degradation. Moreover, in our objective, there is no small displacement term which is used in SIFT flow so that the registration can be more sensitive to subtle changes between two frames. Figure 3 shows two representative examples demonstrating the difference between the three methods. The offset fields are visualized with the color encoding scheme in [26]; please see Figure 4 for more details.

Compared with the typical methods for FDD [48], our method relies on image registration rather than the feature extraction and thresholding strategy. Conforming to the data similarity and smoothness constraints, the correspondences between the pixels of two frames are established in an “optimal” manner through a probabilistic inference process (i.e., the dual-layer loopy belief propagation); furthermore, the pixels with high Harris corner response are usually located on the boundaries of objects; therefore, when Harris corner response is integrated into the registration objective, the registration can be, to some extent, considered as an object-level matching process. As a result, even though the registration objective involves more parameters (i.e., , , and ) than typical feature extraction and thresholding-based methods do, we will show in our experimental part that once the parameters are calibrated, the proposed method can perform more robustly than the feature extraction and thresholding-based methods.

4. Experimental Results

4.1. The Dataset

As far as we know, there is no publicly available dataset dedicated for FDD evaluation. Therefore, we created a dataset to evaluate the performance of the proposed method, especially for the degraded cases. We captured five indoor and eight outdoor video clips (named “v01” to “v13,” and “v01”∼“v05” are indoor scenes) with Panasonic HDC-Z10000GK camcorder. The videos were shot in the Science Park of Harbin Institute of Technology. The clips are captured from different scenes, and their contents include characters, landscapes, buildings, and plants. Several screenshots from our dataset are shown in Figure 5. The video clips are H.264 encoded by the built-in codec, and then we convert these clips into the .mp4 format with Adobe Premiere Pro CS 5.5. The resolution of the clips is , and the frame rate is 25 FPS. Based on these original clips, we created three forgery subsets: the MCOMP subset, the MCOMP + AGN subset, and the MCOMP + INT subset. The details of these subsets are listed in Table 1. Each original clip corresponds to 9 forged versions, and the whole dataset consists of 117 forged videos in all.

The magnitude of the additive Gaussian noise and intensity change is moderate so that they are hardly perceptible. The duration of the forged video clips varies from 8 seconds to 30 seconds.

4.2. The Efficiency of LSH-Based Coarse Matching

As mentioned above, given a subsequence , in the coarse matching stage, we exploit LSH to find , with being the set of duplication candidates of . In theory, to use the -stable distribution-based LSH, we have to assign the distance threshold and the error probability to determine the parameter in (1). However, since we only use LSH as a coarse matcher and the result of the coarse matching does not have to be quite accurate, we should just make sure that , where is the duplication of . In this sense, we can determine by training rather than by firstly assigning and and then calculating from and .

To make sure , we define as the completeness of the correctly collected duplication candidates, and for the training set, we should choose such thatwhere is the number of correctly collected duplication candidates and is the actual number of duplicated subsequences. Given the premise in (9), we use the average number of collisions, , to measure the efficiency of the coarse matching step as follows:where is the total number of collisions. It is straightforward that monotonically increases with , and we prefer smaller when (9) is guaranteed.

We randomly select four video clips which are most seriously degraded (i.e., have been attacked by additive Gaussian noise of the standard deviation of 10 or by downscaling the intensity by ) from the MCOMP + AGN and MCOMP + INT subset, respectively, to train the parameter . The block size of the block-wise GIST feature is , the subsequence length is set to 5, and we scale each frame to of their original size before feature extraction. We constructed 80 hash tables for the coarse matching stage; therefore, a pair of subsequences is considered to be identical only when their hash values collide more than 40 times. Under this configuration, and vary with as shown in Figure 6. We set , where reaches , and .

It is obvious that the value of and depends on the content of the videos. We list in Table 2 the mean values of and of the forged versions corresponding to each original clip. Most of the mean values of are above 0.98. The two smaller values (0.94 for v03 and v13) are both caused by intensity degradation. For video clip v03 in group INT95, the value of was only 0.61. This is because the intensity of the pixels of the duplicated frames is large; therefore, a small scaling factor can result in remarkable intensity change. On average, duplicated subsequences are missed during the coarse matching stage.

On the contrary, the average of for different scenes is 0.10; this implies that, in our duplication verification stage, on average, we just need to perform 0.1 comparison for each subsequence. In contrast, without the coarse matching stage, on average, we have to compare each subsequence with about other subsequences, where is the number of frames. is typically larger than 200 in our dataset and can be much larger in practice. In this sense, for our dataset, the computational load for the duplication verification stage is reduced by 3 orders of magnitude. We will show later that it is worth performing this coarse matching step, in spite of its time complexity.

4.3. The Detection Capability

In this section, we investigate the detection capability of our method. In our implementation, we randomly select 7 forged clips to calibrate , , and , and , , and are empirically set to be 1.8, 635, and 12,800, respectively. The frames are resized by a factor of (for acceleration) before registration. We binarized the resulted offset field images and masked out the nonzero regions whose area is less than of the whole field image to account for outliers. We use precision (10), recall (11), and -score (12) to evaluate the performance:where TP, FP, and FN are the number of correctly detected duplication frame pairs, the number of falsely detected duplication frame pairs, and the number of undetected duplication pairs, respectively.

We compared our method with [4, 8] (denoted Farid and Li, respectively), and the related parameters are set to be identical to those in [4, 8], respectively. The comparison results for the MCOMP subset are shown in Tables 35. The label “v01” in the first row denotes the forged version of video clip v01 in the current forgery group, and “average” means the average value of v01 to v13, and the same hereinafter. For v07 to v12 in the MCOMP100 group, all three methods performed perfectly. For v05 and v13, our method obtained rather low precision; this is because some shots in v05 and v13 are almost still, and the duplication verification step failed to differentiate the excessively similar frames. On the contrary, [4] is quite effective in terms of discriminative power. Our method outperformed the other two in the first four cases. It should be noted that [8] detected none of the duplicated frames in v02 to v05 and v13 for differing reasons. In [8], when the temporal correlations of the frames in a subsequence are all above a certain value, this subsequence is considered as static and then discarded; therefore, in v05 and v13, the subsequences whose frames are quite similar to each other have not been compared with other subsequences at all. In contrast, the duplicated frames in v02 to v04 are missed due to the inappropriate correlation threshold.

For the MCOMP80 group, the precision of [4] for v05 and v13 dropped to 0.44 and 0.50, respectively, and the performance of [4] for the rest of the clips just slightly changed. The performance of [8] was rather poor for this group: it failed to detect 8 out of 13 forged video clips. In this group, the precision for v08 and recall for v07 of our method decreased to 0.50 and 0.83, respectively, and the results for other clips were almost the same with those in the MCOMP100 group.

When the quality factor of the last compression dropped to 60, the performance of [4] significantly decreased. Specifically, 6 video clips were mistakenly judged as unchanged. In contrast, most results of our method maintained a relatively stable value.

The detection results for the MCOMP + AGN subset are shown in Tables 68. In the AGN1 group (Table 6), our method outperformed the other two for the first four cases. Our recall rate for v08 is a little lower than the other two. Besides, for v05 and v13, [4] still obtained better results than our method, while the precision for v05 dramatically dropped to less than 0.50. [8] obtained the worst results for v01 to v03, v05, v06, and v13. In Table 7, it is interesting that the precision of [4] for v05 in the AGN5 group is 1.00, which was supposed to be less than 0.50 due to the stronger noise. According to our observation, the difference between the source and target frames in v05 in this group is indeed small, which may be caused by the encoding mechanism of the lossy compression. In Table 8, for most cases, the precision and recall of [4, 8] dropped to 0. Contrarily, our method is rather stable across different groups, and the performance hardly changed.

The detection results for the MCOMP + INT subset are shown in Tables 911. Except for v05 and v13, our method is on par with or excelled the other two methods for most cases. Since [8] used correlation of pixel intensity as their feature, which is rather robust against intensity change, in this subset, [8] performed better than it did in the COMP + AGN subset. On the contrary, [4] is rather sensitive to intensity change. For the INT95 group, [4] even detected none duplicated frames at all for 10 out of 13 forged clips.

When comparing frames in degraded videos, the distances between the features can easily fall out of the range of a fixed threshold. As can be seen from Tables 311, when the degradation gets stronger, more than half of the forged video clips have been falsely judged as innocent by [4, 8]: none of the duplicated frames has been detected. By contrast, our method performed stably across different test groups. Although, occasionally, our method performed worse than the other two methods (especially for v05 and v13), the average of the precision, recall, and -score, without exception, significantly outperformed those of [4, 8]. Even for the strongest degradation groups, i.e., MCOMP60, MCOMP + AGN10, and MCOMP + INT95, the average values of precision, recall, and -score of our method are above 0.8 (for the worst case, in INT95).

4.4. The Running Time

The running time of the three methods is closely related to the content of the videos. When comparing subsequences, once any corresponding pair of frames is found to be not identical, the comparison between the current pair of subsequences terminates. Therefore, the video clips whose frames are highly similar to each other will result in more processing time. The comparison between the running time of the three methods is given in Table 12. The frames are scaled by a factor of for all the three methods, and all the experiments are conducted on a workstation with Intel Core i7-2600 processor and 24 GB RAM. We implemented the three methods with MATLAB R2014a.

As mentioned earlier, in [8], if the correlation coefficients between the frames within a given subsequence are all above a predefined value, the subsequence will be considered as static and discarded. Such subsequences will not be compared to other subsequences, and that is why the running time of [8] is so short for scenes v05 and v13 (recall that [8] detected none of the forgeries corresponding to these two scenes).

In fact, both the structural similarity and correlation coefficients used in [4, 8], respectively, can be computed much faster than the image registration process in our method. For each single pair of frames to be compared, the structural similarity and correlation coefficients can be calculated in about 0.03 and 0.05 second, respectively. In contrast, in our method, the image registration procedure takes about 4 seconds for each pair of frames to be compared. Even so, our method is faster than the other two for most cases. The coarse matching step plays an important role for acceleration. As we have demonstrated in Table 2, the total number of subsequences which need finer duplication verification can be reduced by several orders of magnitude, especially for the video clips whose content changes rapidly across frames. The average running time of each step in our method is listed in Table 13.

The coarse matching step accounts for about of the total running time, and it takes less than one second for each frame on average. Such a step is well worth performing: without coarse matching, even a 10-second video clip will cost us several hours to detect the forgery.

5. Conclusion and Future Work

In this paper, we proposed a new method for frame duplication detection, particularly for the degraded videos. Our method detects duplication forgeries in a coarse-to-fine manner and consists of two steps: coarse matching and duplication verification. In the coarse matching stage, we use locality-sensitive hashing to precluster the visually similar subsequences. Through coarse matching, the total number of subsequences which need finer duplication verification can be reduced by several orders of magnitude. The duplication verification step exploits image registration to identify the identical subsequences. We encode the stability information of different regions into the registration objective function such that the registration can work stably for degraded videos. Being different from existing methods, our detection process does not rely on a fixed distance threshold, which is typically unreliable for degraded videos. Experimental results show that our method outperformed state-of-the-art methods for most cases and exhibited outstanding robustness under different conditions. However, our method cannot distinguish between highly similar frames; as a result, for the video clips whose content just slightly changes across frames, the precision can be rather low. Further efforts should be made to improve the discriminative power of the registration process.

Data Availability

The dataset used to support the findings of this study is available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant no. 61771168) and the Natural Science Foundation of Heilongjiang Province of China (Grant no. F2017014).