Abstract

The technology of frame interpolation can be applied in intelligent monitoring systems to improve the quality of surveillance video. In this paper, a region-guided frame interpolation algorithm is proposed by introducing two innovative improvements. On the one hand, a detection approach is presented based on visual correspondence for detecting the motion regions that correspond to attracted objects in video sequences, which can narrow the prediction range of interpolated frames. On the other hand, spatial and temporal mapping rules are proposed using coherency sensitive hashing, which can obtain more accurate predicted values of interpolated pixels. Experiments show that the proposed method can achieve encouraging performance in terms of visual quality and quantitative measures.

1. Introduction

Frame interpolation plays a very important role in intelligent monitoring systems. This technology can not only increase the frame rate of surveillance video to meet the requirements of monitoring display devices [1] but also forecast the missing frames to obtain smooth surveillance video. The essence of frame interpolation is to predict intermediate frames by using two given frames in video sequences [2], and it can be roughly divided into two categories: the optical flow method and the block matching method. The former uses time-domain variation and correlation of pixel intensity data to determine the location and value of each pixel [3]. However, in practice, because of the existence of a variety of factors such as multiple light levels, transparency, and noise, assumptions of brightness constancy and spatial smoothness in the optical flow basic equations cannot be always satisfied [4, 5]. The latter finds the best match for each block by minimizing the matching difference [6, 7]. Due to its simplicity, this method is easy to implement, but blocking artifacts will occur in the interpolation frames [8]. In brief, these existing methods still suffer from poor quality of video frames. Therefore, it would be beneficial to develop an algorithm to obtain high-quality interpolation frames.

In this paper, a novel region-guided frame interpolation algorithm (RGFI) is proposed, the goal of which is to improve the definitions of intermediate frames of the surveillance video. To achieve this, two techniques are introduced for frame interpolation. The first technique is visual correspondence using local descriptor matching [9, 10]. For this purpose, compact real-time descriptors (CARD) [11] are used, an approach which was proposed recently to establish visual correspondence quickly between two images. Moreover, the computation time of each descriptor is approximately 16 times faster than that of the scale-invariant feature transform (SIFT) [12]. The second technique is the latest approximate nearest neighbor (ANN) method [13], called coherency sensitive hashing (CSH) [14], which was proposed to find matching patches quickly. CSH relies on hashing to propagate information through similarity in appearance space and neighborhood in the image plane. Its advantage is the use of special observations to establish candidate block sets, which can avoid many artifacts along edges when reconstructing an original image. Through the advanced techniques described above, the interpolation task for video sequences is achieved satisfactorily from a different perspective.

The main contributions of this paper can be summarized as follows: a novel and comprehensive frame interpolation framework based on spatial and temporal correlations in video sequences and a detection approach based on visual correspondence to capture motion regions so as to narrow the prediction range of interpolation frames. The foundation of the detection approach is the frame difference technique and CARD technique, the spatial and temporal block definition and spatial and temporal mapping relationship definition for better implementation of the RGFI algorithm. On this basis, this approach is combined with CSH to construct spatial and temporal mapping rules so as to predict values of interpolated pixels accurately, application of the proposed algorithm to video resizing. Using accurate motion regions obtained in the proposed algorithm, the important content in the surveillance video can be preserved on the premise of ensuring global visual effect and can be displayed on the monitoring display devices with different resolutions.

The rest of the paper is structured as follows. Section 2 provides an overview of the proposed RGFI algorithm. Section 3 demonstrates the implementation details of RGFI. Section 4 presents experimental work carried out to demonstrate the effectiveness of the algorithm. Section 5 shows another contribution of our algorithm. Section 6 concludes the paper.

2. RGFI Overview

In this paper, a novel RGFI algorithm is proposed according to characteristics of video sequences. The RGFI framework is shown in Figure 1 and is divided into motion region detection and interpolated pixel computation. First, the detection approach using visual correspondence to capture motion regions is presented. Then the spatial and temporal mapping scheme based on CSH is demonstrated to compute the unknown pixel values in the motion regions obtained. At the same time, for the pixels of other regions of input video frames, only the original values are kept. Through these two steps, high-quality interpolation frames are produced.

3. Implementation of RGFI

3.1. Motion Region Detection Using Visual Correspondence

The motion regions of the surveillance video are very important in the intelligent monitoring system. Therefore, by finding these regions, it is possible to speed up the accomplishment of the interpolation task. Unlike previous methods, the method proposed here takes advantage of visual correspondences based on CARD to obtain motion regions between video frames. The detailed implementation of the detection approach includes three parts: motion region initial estimation, key point correspondence establishment, and motion region determination.

Because the frame difference method [15] can quickly find the outline of the moving target in video sequences, it is first used to estimate rough motion regions, as shown in the following: where and denote the two consecutive frames of the input video sequences, is the initial motion region corresponding to , is a predefined value, and is the difference between the maximum value and the minimum value in two consecutive frames as follows:

Second, inspired by [11], comparatively accurate key point correspondences are established in the initial motion regions; the establishment process includes the following four steps.

Step  1.  Construct an image pyramid [16] for the initial regions as follows: where is a scale function, and are the values of the abscissa and the ordinate, is the level of the image pyramid, and is the downsampling factor.

Step  2.  A corner detector technique [17] is used to find the key points in the image pyramid, and the corresponding point set is denoted as . Similarly, the key point set corresponding to the initial region of is obtained.

Step  3.   For , the orientation histogram is determined; the corresponding magnitudes and orientations are given in (4). At the same time, log-polar binning is used to achieve good discrimination ability [18], as follows where , denotes a rotated binning pattern, is a predefined value, , , and are constants, and denote the coordinate values of in relative coordinates, represents the quantization function, given in (6), and denotes the quantization result of . On this basis, a spatial binning table [19] is used to extract and to obtain the descriptor of , setting the descriptor set

Step  4. For , compute its corresponding descriptor in the next frame according to the following: where is the changed short binary code of any descriptor , is the length of changed binary code, and is a weight matrix. In this way, the corresponding set of descriptor set and the matching key point of can be obtained.

Finally, more accurate boundary values for the motion regions can be determined according to the locations of key points and their correspondences between consecutive frames. Let ; the boundary values of motion region can then be computed using the following equation: where , , , and are the left border, right border, top border, and bottom border and , , ,  and are the predefined minimal deviation values.

Figure 2 shows one example of motion region detection. Figures 2(a) and 2(b) are the two consecutive frames of the walk video, and the key point correspondences between them are presented. From the results for the motion region shown in Figure 2(c), it is evident that the proposed detection approach can perform well in expressing the motion content of video sequences, which lays the foundation for interpolated pixel computation.

3.2. Interpolated Pixel Computation Using Coherency Sensitive Hashing

To facilitate the computation of interpolated pixels, it is necessary to define some concepts in advance.

Definition 1. For obtained motion regions in two consecutive frames, the original motion region and the mapping motion region are defined. or is divided into overlapping image blocks, and each block is defined as a spatial and temporal block, ST for short. Denote the total numbers and sets of ST as ,  , and ,  , respectively.

Definition 2. Given ,  assume that is the corresponding mapping block of and that it is computed through spatial and temporal mapping rules. Then is defined as the spatial and temporal mapping relationship between and .

As far as is known, the block matching method generally finds the only match for each original divided block. However, the computation method proposed here can obtain two or more matching blocks for each ST so as to approach more closely the true values of interpolated pixels. This computation method includes four parts: ST projection and conversion, mapping block computation, nearest block determination, and unknown pixel computation.

First, the projection of ST is computed on Walsh-Hadamard kernels [20]. Specifically, for each ST , , gray-code filter kernels [21] are computed as the transform kernel, as shown in (9) and Figure 3. The results are stored in the temporary set , in which :

To accelerate ST mapping, a hash value is assigned to each transform result, and each hash function maps a dim-dimensional vector onto the set of integers [22]. In this way, the corresponding hash table is constructed, and each ST is saved in the entry , in which is given by the following: In (10), is a fixed integer value which is set in advance and is a random number according to a uniform distribution on the interval . At the same time, for the two vectors , , let ; they are hashed to the same value with probability :

Then spatial and temporal mapping rules are established for ST to expand the number of mapping blocks of ST. Here four mapping rules are established, shown as Rules 14. The first three mapping rules take advantage of a coherency sensitive hashing technique [14] to mine spatial correlations between frames, and Rule 4 makes direct use of temporal correlations of video sequences.

Rule 1. If , then , where , .

Rule 2. If , then , where , , and is the left neighboring block of .

Rule 3. If and  , then , where , , .

Rule 4. If the coordinate values of the upper left corner of are equal to those of , then , where , .

The third step is to determine the nearest blocks from the obtained mapping blocks of ST. Here a freedom search technique [8] using the patchmatch method is used to initialize the nearest blocks for each ST as follows: In (12), , α is a constant value, is a search radius, and is the best block of any ST . Then for each hash table, (13) is used to compare and update the nearest blocks:

Finally, the predicted value of the pixels in the ST can be computed through a smoothing operation for the nearest blocks obtained as follows: where denotes the predicted value of an interpolated pixel, is the nearest blocks set of , , and . In addition, to obtain color frame interpolation, the , , and channel values of each pixel can be computed in the same way.

4. Experimental Results and Discussion

In this section, the proposed RGFI algorithm is compared with other frame interpolation algorithms, including the block matching methods of three-step search (TSS), the adaptive rood pattern search (ARPS) [23], the optical flow method of Horn and Schunck (H&S) [24], Classic+NL-Full [5], and CSH [14]. Each algorithm was run in MATLAB using a PC with Intel(R) Pentium(R) 4 CPU processor, 3.00 GHz, and 3 GB of main memory. To evaluate algorithm performance, one in every two frames of the original video sequences was removed. Then this removed frame was reconstructed using different frame interpolation algorithms, and the reconstructed frame was compared with the removed frame. In all experiments, the detailed parameters were as follows. Since affects the accuracy of motion region detection, we try several different values and finally determine its value to be 0.2. According to [11], we found that the best choice of parameters is , respectively, and this provides good discrimination ability. Adjustment parameters are used to complete the conversion from key points to motion regions and by experiments we found that the best choice of each parameter is 8. The value of is 0.5 and this is the same with [8]. Based on [14], we choose three different values for , namely, 8, 16, and 32. At the same time, we find that the appropriate value is 16 according to the obtained experiment results. The test video sequences used for these experiments were walk (640 480), jump in place (320 240), silent (176 144), and space (320 240). Walk and jump in place were provided by the image sequence evaluation research laboratory in Barcelona [25]. Silent was obtained from the video trace research group at Arizona State University [26]. Space was obtained from Youku Web. The selection of test video sequences covered a variety of different background and object motions, which are frequently found in real video.

Figure 4 shows the frame interpolation results using the six algorithms. Every row in the figure shows the interpolated frames for the same video using different algorithms, and every column shows the interpolation results for different videos using the same algorithm. From the red-bordered region of each figure, the visual differences among algorithms can be determined. TSS exhibited a poor interpolation effect, ARPS could not reveal the whole motion content, H&S and Classic+NL-Full introduced a suspension effect, and CSH produced disappointing effects, such as vague legs on the walking man. From these figures, it is evident that the proposed algorithm shows comparatively better performance in terms of visual quality.

To validate the algorithm further, objective measurements are also provided. Peak signal-to-noise ratio (PSNR) and root mean square error (RMSE) are traditional quantitative measures of accuracy. Figures 5 and 6 show their values for the interpolation frames of different video sequences using different algorithms. In these figures, the red curves represent the proposed algorithm, and the five black curves using a different format represent comparison algorithms. It can be observed that the PSNR values for the interpolation frames obtained using the proposed algorithm are generally the highest, and the RMSE values are the lowest. However, the proposed method might produce unsatisfactory results, for example, the PSNR value of interpolation frame 58 of walk. Overall, it can be clearly demonstrated that the proposed new method outperforms the other five algorithms.

The MSSIM measure was also used to evaluate the visibility quality of interpolation frames. MSSIM assesses the image visibility quality from an image formation point of view under the assumption of a correlation between human visual perception and image structural information [27, 28]. Figure 7 shows MSSIM comparison results using the six frame interpolation algorithms for the different video sequences. Note that the proposed RGFI algorithm generally achieved a greater MSSIM than the other five algorithms, and the results show that the interpolation frames obtained using the proposed algorithm are closer to the original frames in terms of image structure similarity.

Table 1 summarizes the average PSNR, RMSE, and MSSIM values for each video sequence using different algorithms. From this table, it can be significantly determined that RGFI always obtains the highest PSNR and MSSIM and the lowest RMSE. In a word, because of its encouraging performance in terms of video visualization and quantitative quality assessment, the proposed algorithm is very competitive in frame interpolation.

5. Another Contribution

We combine RGFI with a seam carving approach [29] to achieve video resizing, so as to obtain high-quality resizing results displayed on the monitoring display devices with different resolutions. Figure 8 shows comparative results for frame 10 of the video of “walk with dog” using four methods: scaling method with uniform resizing ratio, the best cropping method directly using cutting, the seam carving method by removing or duplicating seams, and the proposed algorithm using accurate motion regions obtained from RGFI. It can be seen that using the scaling method (see Figure 8(b)), the walking man and his dog all become vaguer than before. Using the best cropping (see Figure 8(c)), the walking man is only partly displayed, resulting in missing original information. Only using seam carving (see Figure 8(d)) does the prominent part of this video sequence change much less than before. From Figure 8(e), it is apparent that the proposed method can protect the prominent object of the original frame when the video sequence resolutions are changed. Table 2 summarizes five evaluation indicators, including average gradient (AG), information entropy (IE), edge intensity (EI), spatial frequency (SF), and image definition (ID) values. These indicators are used to measure the resizing quality of the video frames. From this table, it is clear that the proposed method achieved the highest AG, IE, EI, SF, and ID values, which indicates that the proposed method can effectively improve resizing quality when image sequence resolutions are changed.

6. Conclusions

In this paper, promising RGFI method for intelligent monitoring systems has been presented. The main feature of this method is its ability to obtain relatively high-quality of interpolated frames according to spatial and temporal correlations in video sequences. The implementation process involves two steps: motion region detection and interpolated pixel computation. The former determines the prediction range of interpolation frames through a detection approach based on visual correspondence, and the latter computes interpolated pixels using spatial and temporal mapping rules based on coherency sensitive hashing. Experimental results show that the proposed algorithm outperforms the other five representative frame interpolation algorithms examined on subjective quality and in quantitative measures. At the same time, RGFI combined with a seam-carving approach can achieve video resizing. In a word, a promising frame interpolation method has been proposed for the intelligent monitoring systems. However, as a new frame interpolation algorithm, RGFI also has its disadvantage and the disadvantage is to take comparatively longer time caused by higher complexity. In the future, we will exploit a multicore architecture to do parallel computing, so as to reduce running time of the algorithm.

Acknowledgments

This work was supported by the National Basic Research Program of China (973 Program) 2012CB821200 (2012CB821206), the National Natural Science Foundation of China (nos. 91024001, and 61070142), and the Beijing Natural Science Foundation (no. 4111002).