Abstract

A novel video shot boundary recognition method is proposed, which includes two stages of video feature extraction and shot boundary recognition. Firstly, we use adaptive locality preserving projections (ALPP) to extract video feature. Unlike locality preserving projections, we define the discriminating similarity with mode prior probabilities and adaptive neighborhood selection strategy which make ALPP more suitable to preserve the local structure and label information of the original data. Secondly, we use an optimized multiple kernel support vector machine to classify video frames into boundary and nonboundary frames, in which the weights of different types of kernels are optimized with an ant colony optimization method. Experimental results show the effectiveness of our method.

1. Introduction

Video shot boundary recognition is a fundamental process towards video summarization and analysis. There are many boundary recognition methods already presented [1, 2]. Common method to recognize shot boundary is comparing the difference of two adjacent frames with a threshold. In paper [3], the abrupt shot boundary is detected based on an adaptive threshold and gradual transition boundary is detected with a set of standard templates. Warhade et al. [4] detected shot boundary with cross-correlation coefficient, stationary wavelet transform, and combination of local. Thakar and Hadia [5] proposed a new gradual shot detection method in which the threshold can be adaptively determined based on the totaly information change of video frames. Huo et al. [6] used a statistical model according to the video frame differences to determine the adaptive threshold. In paper [7], the threshold is automatically determined according to the magnitude of color differences quantification. Warhade et al. [8] first extracted structure features from each video frame by using dual-tree complex wavelet transform and then decided the shot boundary based on the spatial domain similarity. In order to reduce the computation, Gao and Ma [9] used color histogram and mutual information to measure the difference between frames, and then the corner distribution of frames is utilized to exclude most of the false boundaries.

The main disadvantage of these methods is susceptible to the effect of thresholds, which can make a mistake for some complicated long gradual shots. To resolve this problem, people see the video shot recognition as a categorization task. In paper [10], a fuzzy logic method is used to detect shot boundary. This method contains two processing modes, where one is dedicated to detection of abrupt shot and the other for detection of gradual shot. In paper [11], the video feature including HSV (hue, saturation, value), edge orientation, and texture feature is obtained, and then the Kohonen self-organized network is used to recognize shot boundary. Huang et al. [12] classified video frames with a radial basis function neural network. Mohanta et al. [13] used a multilayer perception network to classify video frames based on local features matrix. To improve the recognition performance, Li et al. [14] first removed some frames from the original video which were clearly not shot boundaries, then used a novel SIFT key point matching algorithm to detect shot boundary. Zhao et al. [15] used context feature vector and Tabu-SVM to recognize shot boundary. In paper [16], the proposed approach first detected general shot boundary with Fisher criterion and then classified the cut and gradual shot with SVM. In order to improve the effect of SVM, Zhao et al. [17] optimized the parameters of SVM with particle swarm method. In addition, Lankinen and Kamarainen [18] detected shot boundary using a visual bag-of-words approach. Donate and Liu [19] extracted salient features from a video sequence and tracked them over time to estimate shot boundaries. Li and Chen [20] recognized shot boundary with macroblock type information which can save a lot of computation cost.

In this paper, we present a novel method to improve the shot boundary recognition accuracy. Firstly, based on the analysis of LPP, we present an adaptive LPP to extract more useful and discriminating features. Secondly, we recognize shot boundary with an optimized multiple kernel support vector machine.

The rest of this paper is organized as follows. Section 2 abstracts the theoretical fundamentals about LPP. In Section 3, we extract the video feature with improved LPP. Section 4 uses an optimized multiple kernel SVM to recognize shot boundary. Some experiments are used to evaluate the presented method in Section 5 and all the paper is concluded in Section 6.

2. Theoretical Fundamentals

2.1. Locality Preserving Projection

Locality preserving projection (LPP) is a dimensionality reduction method which can be explained by the graph theory [21]. Assume there is-dimensional data point set; we try to find a project matrixto project these data point into a low-dimensional subspace, and the projection can be expressed as.

The objective function of LPP is as follows: where the weight matrixcan be defined as follows: Then, the objective function of LPP can be converted into the following minimization problem: whereis a diagonal matrix andis a Laplacian matrix.

Lastly, the project matrixcan be obtained by solving a generalized eigenvalue problem as follows:

Let the column vectorsbe the solutions of (4), ordered according to their eigenvalues,; we can define the following transformation form:

2.2. Weight Definition

In the LPP, the weight between two points is defined to be a simple either 1 or 0 or heat kernel, which cannot reflect the class information. Given, letandbe the label andthe nearest neighbors of the point, Li et al. [22] presented an orthogonal discriminating projection (ODP), in which the weightbetween two points is defined as follows: wheredenote the geodesic distance between pointsand, andis a parameter which is used as a regular.

Figure 1 shows that the typical plot ofis a function of, where,  ,   .

From Figure 1 we find thatis not a monotonically decreasing function of, which is due to the fact thatis a nonmonotonic function. When,is monotonically increasing. In the actual applications,should decrease with the increase of. Zhang et al. [23] proposed a modified ODP (MODP) with correlation coefficientbetweenand.

Obviously, the above definitions aboutconsider only the space structure and not the manifold structure. Meanwhile, the presentednearest neighbors do not reflect the real information of manifold structure.

3. Video Feature Reduction

In this section, we extract more discriminating video features from the original color, shape, and texture feature. Firstly, we propose an improved LPP. We adaptively select thenearest neighbors of each point and introduce the model information into the new weight similarity which can make the weight be a monotonically decreasing function of distance. The major merit of ALPP is preserving the local structure and label information of original data. Then we use ALPP to extract more discriminating video features for shot boundary recognition.

3.1. Mode Detection

In order to preserve the mode information of data points, we use median-shift method relied on computing the median of local neighborhoods instead of the mean to detect mode. Considering that the median of a set is a point in the set, the method is more robust than mean-shift method. Most importantly, the median-shift method is not a nonparametric method that does not require a prior knowledge of the number of clusters nor does it place any limitations on the shape of the clusters. The process of mode detection is as follows [24].

Suppose we are given a set,  ,  ; define the Tukey depth of a point to be where the Tukey depth of a point is the minimum of its depth along any projection vector.

Firstly, the median ofis an element with maximal depth

Then we seek the mode with the median-shift algorithms. For each point we wish to ascend in the direction of the positive gradient of the underlying probability density function. We define the median-shift for pointin setas whereis a bandwidth parameter. Sinceuses necessarily a point in the dataset, there is no need for multiple iterations in this step. After one iteration all points are linked and we can only go through the list of discovered medians to find a mode. The results of this step are a set of modes representing clusters.

Next, we proceed by iteratively working on the reduced set of modes, replacing the median calculation by weighted median calculation until convergence, where weights are the number of points mapped to the given mode. The weights are taken into account during the calculation of the depth of each point in the next iteration by modifying the definitions as follows: and definingas the total weights in the neighborhood of, then

Finally, in case of data clustering, and not only mode detection, we map each data point to its closest mode. Letbe the model of the point, and letbe the set ofmode; we can obtain where.

3.2. Adaptive Neighborhood Selection

Considering the unchangeableneighborhood not reflecting the mode information of manifold structure, we apply an adaptive strategy to select the neighborhood of each point.

Firstly, we define the manifold adjusted length of line segment whereis the Euclidean distance betweenandandis a flexing factor. Obviously, this formulation can be utilized to describe the global consistency. In addition, the length of line segment between two points can be elongated or shortened by adjusting the flexing factor[25].

Then, let data points be the nodes of graphand letbe a path of lengthconnecting the nodesandin which,  . Letdenote the set of all paths connecting nodesand; the manifold distance metric between two points is defined as follows: wheredenotes the manifold adjusted length of line segment.

Next, the average manifold distance of pointis defined as follows: whereis the total number which meets the conditions.

Lastly, the adaptive neighborhood of pointis constructed as which shows that the neighborhood of pointis adaptively built with the points where the distance is shorter than the average distance.

The major merit of the adaptive neighborhood selection method can be summarized as follows.(1)The manifold distance metric can measure the geodesic distance along the manifold, which can elongate the distance among data points in different regions of high density and simultaneously shorten that in the same region of high density.(2)The neighborhood of each point is different from others, which is decided by the local density of the origin space. When the local density ofis lower, the neighborhood is higher and vice versa.

3.3. Improve Weight Definition

In order to resolve the problem which is described in Section 2.2, we improve the weight definition between two pointsandas follows: wheredenotes the distance between pointsand,   andare the regular parameters, andis the label of the pointfor definition ofplease refer to (16).

Figure 2 shows that the typical plot ofis a function of,   , where,  , and.

Similar to paper [26], letbe the local weight, and letbe the intermode discriminating weight. The new weight definition can be viewed as the local weight and discriminating weight. It means that the discriminating similarity reflects both the local neighborhood structure of model and label information of the data set.

The properties and the corresponding advantages of the improved weight definition can be summarized as follows.(1)The improved weight definition make use of the label information and model information to preserve the manifold information, which is very important for classification.(2)Since the value ofranges from 0 to 1, no matter how far the two points are, the inter-mode similarity can be limited in certain ranges.(3)With the decrease of the geodesic distance,is a decrease, which means that two near points from different modes have a smaller similarity.(4)Note thatandalways decrease whenandare far apart and they increase whenandare close. Thus,is a monotonically decreasing function of.(5)The use of mode prior probabilities makes the newly designed discriminating similaritymore suitable for perverting the local structure information of original data.

3.4. Video Feature Reduction

In order to enhance the discriminating information for shot boundary recognition, we hope to combine the label information and model information to improve the discriminating ability and preserve the local neighborhood structure of the original data.

Due to introducing the similarity matrix, we define the local scatter matrix as follows [23]: where,  ,  .

Then we define the nonlocal scatter matrix as follows: where,  .

Lastly, the objective function of the improved LPP can be expressed as follows: whereis an adjustable factor.

So we can find thatconsists of the eigenvectors associated withtop eigenvalues of the following eigen-equation:

The algorithmic procedure of video feature extraction is stated below.(1)Extract original video feature including the color, shape and texture feature.(2)Perform PCA projection. In order to make the matrixbecome nonsingular, we project the dataset into a PCA subspace with a transformation matrix.(3)Define the similarity matrix. For each point, compute the similarity, ifis the adaptive neighbors ofand, and compute the similarity, ifis the adaptive neighbors ofand.(4)Compute the diagonal matrixand Laplacian matrixand then compute the topeigenvalues and its corresponding eigenvectors based on (20).(5)Perform the ALPP transformation. Letbe an optimal projection matrix; we can project the new data into low dimensionality with

4. Shot Boundary Recognition

The process of shot boundary detection includes two steps. Firstly, we extract the video features using ALPP method. Secondly, we detect shot boundary using an optimized MKSVM.

4.1. Multiple Kernel SVM

Support vector machines are a family of pattern classification algorithms which is based on the idea of structural risk minimization rather than empirical risk minimization [27]. However, it is often unclear what the most suitable kernel for the task at hand is. Recently, the multiple kernel learning theory has been used for training different kernels by jointly optimizing both the coefficients of the classifiers and the weights of the kernels which have a more excellent effectiveness for object recognition than SVM [28]. In this paper, we combine several possible kernels to improve the precision of shot boundary recognition.

Letbe a vector of weights for the mixture of kernels. A multiple kernel is the combination of thebasis kernels

Assume there are a data setof labeled examples, whereis the input vector and.

According to paper [29], the primal form of multiple kernel support vector machine (MKSVM) is thus formulated as the following optimization problem:

Similar to the SVM, with the constraint on, the above minimization problem can thus be transformed into the following dual problem:

For the test input, the decision function of MKSVM can be computed as

4.2. Ant Colony Optimization Method

Ant colony optimization method (ACO) is an optimizing method inspired by the foraging behavior of ant colonies [30]. When ants walk between their nest and a food source, they mark the paths with special kind of chemical termed pheromone, and the shorter paths can attract more and more pheromone [31]. In the method, an ant determines its transfer direction according to the amount of pheromone in each path. Firstly, every ant constructs an edge from a start vertex to an end vertex. Then when all ants reach the end vertex, the edges are marked with a pheromone quantity. Thus the colony can converge to the shortest path [32]. In this paper, we apply the ant colony optimization method to solve the weights optimized problem in MKSVM.

Letbe the proximity which an anttransfer from elementtoat iteration.can be defined as whereis the element which has not been unvisited.is the pheromone quantity of pathis a heuristic measure of moving elementto element, andare two parameters that control the relative importance of the information heuristics and exception heuristics factor, respectively.

The amount ofallants pheromone trail on a path deposited step by step. After time, the pheromone quantityassociated with an edge joining elementandis updated according to the following formula: whereis a pheromone evaporation loss coefficient andis the pheromone quantity deposited at iterationby anton an edge joining elementand. Theis usually defined as whereis a constant andis the cost function ofth ant.

For the process of ant colony optimization method please refer to paper [33].

4.3. Shot Boundary Recognition

The process of shot boundary recognition includes the following stages. Firstly, we extract the original feature. Then we reduce the video feature with improved LPP. Lastly, we classify the frame into boundary frame with the optimized MKSVM classifier based on ACO method.

The algorithm of video boundary recognition is described as follows.(1)Extract original video feature which includes colors, shape, and texture feature.(2)Reduce video feature with improved LPP.(3)Define input vector, whereis the video feature vector andisth frame of video.(4)Label the boundary framesand nonboundary frameand build training sample set.(5)Initialize the parameters of ACO. Let time, interactive times. Set the maximum iterative timesand let initial pheromoneand.(6)Updatebyand choose the ant elementas its transform direction according to function (27).(7)Update the taboo table pointer. Move the ant to the selected new element and add the element into ant taboo table.(8)If all elements of the set have been fully traversed, go to step 7 or else go to the next step.(9)Recalculate the pheromone of each path, if, go to step 6, or else save the weights.(10)Classify the video frame into boundary frame based on the following rule:

5. Experiments and Analysis

In this section we present some experiments to validate the proposed approach. Firstly, we investigate the performance of the proposed ALPP method for video feature exaction experiment. Then we recognize shot boundary with the proposed method. The video database includes movie, news, sports, documentary, and MTV. The reason of selecting this type video is that news videos have many long abrupt shots, MTV has fast changes of scenes, sport have fast camera movements and zooming in, movie includes many gradual shots. Some shot samples are shown in Figure 3. For evaluation, we use the common figures of merit of the algorithm standard precision and recall [34]

5.1. Video Feature Exaction Experiment

In order to testify the effectiveness of the adaptive LPP (ALPP), we extract video feature with ODP, MODP, and ALPP method. For convenience of comparison, we use the same method in paper [35] to detect shot boundary. In the experiment, original video feature is built on color feature, shape feature, and texture feature. Then ODP, MODP, and ALPP method, are used to extract video feature withand. In the ODP and MODP methods we adoptnearest neighbor criterion to define the adjacency matrix, in which theset to 10. The results of performance comparison by using ODP, MODP, and ALPP are shown in Table 1.

From Table 1, we can find that ALPP obtains comparable recognition performance to LPP and ODP with the same shot boundary detect method. In ALPP method, the improved weight definition combines the label information and model information with adaptive nearest neighbor select strategy, which is very important to reflect the data information truthfully. The experiment shows that ALPP has more useful and discriminating ability to extract video feature than others.

5.2. Shot Boundary Recognition Experiment

In order to investigate the performance of the proposed method, we recognize different video shots, especially for gradual shots. The system performance is compared with the multilayer perception network method (MPN method) [36]. We use the original video feature and set the parametersandto be the same as before. In the proposed method, we use polynomial kernel, radial basis kernel and linear kernel to build MKSVM and set,  ,  ,  ,  , and. The experimental results are summarized in Figure 4.

From Figure 4, we found that the above methods can detect not only abrupt cuts but also gradual shots very well, but the proposed method achieves more desired performance than the MPN method for shot boundary recognition. The average precision and recall of the proposed method is up to 94.1% and 91.7%, which is higher by 3.5% and 3.1% than the MPN method, respectively. These results demonstrate that the proposed method is a good tool for shot boundary recognition by using the optimized MKSVM.

5.3. Discussion

Two experiments for different type video have been systematically performed, and so now we can conclude the following.(1)We improve the effect of shot boundary detection in two stages. In the feature extract stage, we use ALPP to extract more useful and discriminating video feature. In the shot boundary detection stage, we use optimized MKSVM obtaining more remarkably boundary detection accuracy.(2)For feature extraction, the proposed ALPP performed better than LPP and ODP. This is because the former makes use of the label information and model information with adaptive nearest neighbor select strategy. At the same time, the improved weight definition can guarantee that two near points from different modes have a smaller similarity.(3)Compared with the MPN method, the proposed method can yield better performance on shot boundary recognition. It owes much to the optimized MKSVM, in which the parameters are optimized by the ant colony method. It should be noted that there are some false detection results in the above methods, which may be due to the existence of irregular object movement and the small content change between consecutive frames.

6. Conclusion

In this paper, we present a new video shot boundary recognition method, which focuses on two key problems: extracting more useful and discriminating feature and improving the accuracy of shot boundary classifier. The major contributions of the paper are to propose an optimized locality preserving present method with model detection and optimized neighbor selection strategy. Meanwhile, an optimized shot boundary classifier based on MKSVM is designed with the ant colony optimization method. Experiments demonstrate that the proposed method is outstanding. The future work is to optimize the other parameters of MKSVM to achieve more desired result.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This project is supported by the Postdoctoral Science Foundation of Central South University, Scientific Research Fund of Hunan University of Finance and Economics under Grant (no. K201205), the Construct Program of the Key Discipline in Hunan Province, Hunan Province Education and Science Issue “Performance Evaluation for College Teacher Based on Adaptive Learning” (no. XJK013CGD083), and the Research Foundation of Science & Technology Office of Hunan Province under Grant (no. 2012GK3064).