Video Genre Classification Using Weighted Kernel Logistic Regression
Due to the widening semantic gap of videos, computational tools to classify these videos into different genre are highly needed to narrow it. Classifying videos accurately demands good representation of video data and an efficient and effective model to carry out the classification task. Kernel Logistic Regression (KLR), kernel version of logistic regression (LR), proves its efficiency as a classifier, which can naturally provide probabilities and extend to multiclass classification problems. In this paper, Weighted Kernel Logistic Regression (WKLR) algorithm is implemented for video genre classification to obtain significant accuracy, and it shows accurate and faster good results.
Recently, video usage gains increasing importance, especially with the advance of recent internet technology and digital media; people have access to huge amount of data through internet and television. It is difficult for people to find videos of interest among these tremendous amounts of data and use them when there is a need, and it is not feasible to watch all the videos searching for the one of interest. Exploiting content-based of video footage is continuously needed for many applications, for example, for retrieving video sequence, creating automatic video summarization, or detecting specific action or activities in a video surveillance . Many works are done dealing with video classification problem by categorizing videos in certain categories or genre, bridging the wide semantic gap between computed low-level features and high-level concepts and helping people to find their videos of interest within narrow domain. To get good understanding of video content, many different techniques have been developed and different video features have been identified for better video representation. Many techniques are used for video classification such as Bayesian, Hidden Markov Model (HMM), Gaussian Mixture Model (GMM), Neural-Network (NN), and Support Vector Machine (SVM). SVM is considered a state-of-the-art algorithm for classifying binary data through its implementation of kernels. Oh et al. developed general framework to perform the fundamental tasks for video data mining which are temporal segmentation of video sequences and feature extraction and how to capture the location of motions occurring in a segment, how to cluster those segmented pieces, and how to find whether a segment has normal or abnormal events . Chen et al. proposed a multimedia data mining framework for extraction of soccer goal events in soccer videos, by using combined multimodal analysis and decision tree . Li et al. also proposed a general framework for indexing and summarizing sports broadcast programs, with a high-level model of sports broadcast video using the concept of an event, defined according to domain specific knowledge for different types of sports . Zhong and Chang present a general framework and effective algorithms to detect the syntactic structures that are at a level higher than shots . Fan et al. propose framework to achieve more effective semantic video classification by integrating the unlabeled samples with a limited number of labeled samples .
Petrushin describes an approach to efficient search for rare and frequent video events and builds a robust GMM classifier to identify real time incoming events for indoor surveillance . Hervieu et al. proposed a trajectory-based HMM framework for video content understanding to solve three tasks: unsupervised clustering of events, recognition of events corresponding to learned classes of dynamic video contents, and detection of unexpected events . Adam et al. present a novel algorithm based on multiple local monitors which collect low-level statistics for detection of certain types of unusual events . Shirahama et al. developed a method for retrieving events of interest in a video archive using “rough set theory” to extract multiple classification rules, each of which correctly identifies a subset of shots of the event .
Maalouf et al. [10, 11] show that Kernel Logistic Regression (KLR), kernel version of logistic regression (LR), proves its efficiency as a classifier, which can naturally provide probabilities and extend to multiclass classification problems. KLR is used for many reasons; the first is that LR and KLR are well studied, and they do not make assumption about the distribution of the independent variables, and last reason they can be extended to handle the multiclass classification problems. Thus with the right algorithm, the computation time can be reduced less than that for the other methods, such as Support Vector Machine (SVM), which require solving constrained quadratic optimization problem.
This paper is going to implement Weighted Kernel Logistic Regression (WKLR) algorithm for video genre classification, because it has direct probabilistic output and it could be extended directly to handle multiclass classification, not like SVM. The rest of the paper is organized as follows: Section 2 explains Logistic Regression and Kernel Logistic Regression model, Section 3 describes the Video Feature Extraction, Section 4 presents the proposed method, Section 5 experimental results are depicted, and Section 6 concludes the paper.
2. Logistic Regression and Kernel Logistic Regression Model
2.1. Logistic Regression
In logistic regression, a single outcome variable, , is coded 1 with probability and 0 with probability . Then varies as a function of some explanatory variables, such as mathematically it is expressed as the following formula Maalouf et al. and King and Zeng [10, 12]: where is a vector of parameters with the assumption that so that the intercept is a constant term. So that the assumption that the intercept is included in the vector will continue from now on. The logistic (logit) transformation is the logarithm of the odds of the positive response and is defined as The logit function can be expressed in matrix form as The regularized log likelihood is defined as where the regularization (penalty) term is added to obtain better generalization.
For binary outputs, the loss function or the deviance DEV is the negative log likelihood and, given by the formula
2.2. Kernel Logistic Regression
For Kernel logistic regression and following Maalouf et al., vector can be expressed as a linear combination of the input vectors as the following : where the vector is known as the dual variable with dimensions, . Now the logit vector can be rewritten as where is symmetric positive semidefinite Gram matrix with dimensions.
Considering the logit link function again shown previously, where the vector is given by , with . this linear function represents the simplest form of an identity mapping polynomial basis function of the feature space such that So the logit link function can be rewritten as Generally the function , maps the data from the lower dimension space to higher dimension space, so that The purpose of choosing the mapping is to convert nonlinear relation between the response variable and the independent variables into linear relation, Maalouf and Trafalis .
Logit link function in case of KLR then could be expressed as where is the th row in the kernel matrix . Mercer’s sufficient and necessary conditions must be satisfied in kernel transformation function, which state that the kernel function must be expressed as inner product and must be positive and semidefinite, Maalouf and Trafalis . Now This implies that So the regularization likelihood can be rewritten with respect to as With deviance iteratively reweighted least squares (IRLS) method is one of the most popular techniques used to find the MLE of in LR models, which uses Newton-Raphson algorithm to solve the score equations. KLR models can also be fitted using IRLS. Each iteration finds the weighted least squares (WLS) estimates for a given set of weights, which are used to construct a new set of weights Oh et al. . The gradient and Hessian are obtained by differentiating ln with respect to . In matrix form, the gradient is where is the probability vector whose elements are given in (14). And the Hessian with respect to is where is a diagonal matrix with diagonal elements for . The Newton-Raphson update with respect to on the -th iteration is Since , then (10) can be rewritten as where is the adjusted dependent variable or adjusted response.
2.3. WKLR Algorithm
Consider is a system linear equations with kernel matrix , vector of adjusted responses , and weighted matrix , and the weights and response vector both depend on , which is the current estimate of the parameter vector, so specifying an initial estimate for can be solved iteratively, using the Conjugate Gradient (CG) method, which is equivalent to minimizing the quadratic problem In order to avoid the long computations that the CG may suffer from, a limit can be placed on the number of CG iterations, thus creating an approximate or truncated Newton direction, Maalouf and Trafalis .
Algorithm 1 describes Iterative Weighted Kernel Logistic Regression Maximum Likelihood Estimate (IWKLR MLE) using IRLS.
3. Video Feature Extraction
Extracting appropriate features is vital for acceptable design of any pattern classifier. In video classification studies, the features used can be categorized as one of three which are text features, audio features, and visual features; some studies divide the features into two main categories: text features and nontext features, and the latter is divided into low-level and semantic video features. Text features contain text extracted from video and user-generated text feature. Nontext features are features extracted from images, audio, and motion; such features could be referred to as low-level features which could be defined as features extracted from video clips and audio track without reference to any external knowledge [13, 14]. Some of researches use features that correspond to cinematic principle for visual features which are composed of color, motion and average shot length. One difficulty in using low-level feature especially visual features is the huge amount of data, and the most used solution for this is using a key frame to represent the shot or by using dimensionality reduction techniques such as Principal Component Analysis (PCA) and wavelet transforms application [13, 15, 16]. Discrete transform is widely used for feature extraction and data redundancy reduction. Proportional advantages of deterministic transforms make them an interesting type of feature extraction approaches. One of the important discrete transform, is Discrete Cosine Transform (DCT), and special properties of the DCT make it a powerful transform in video processing applications [17–20]. DCT has ability for data decorrelation. There are fast algorithms for DCT realization. When applying DCT to video, some coefficients are selected and others discarded for dimension reduction of data. DCT coefficient selection is an important part of feature extraction process. DCT feature extraction is composed of two steps: the first one DCT is applied to the entire key frame to obtain the DCT coefficients, then in the second step some of coefficients are selected to construct feature vectors. The size of the DCT coefficient matrix is the same as the input frame. DCT by itself does not decrease data dimension, so it compresses most signal information in small percent of coefficients. DCT coefficients for frame are calculated as follows : where is defined by is the frame intensity function, and is 2D matrix of DCT coefficients. As mentioned that DCT coefficient matrix is the same as the frame size and to obtain feasible and compact low-dimensional representation of the features, PCA could be used . Given N d-dimensional feature vectors , the mean vector calculated as And covariance matrix is calculated as then the Principal Component assumes the first significant eigenvectors of , that is, . By constructing the eigenmatrix of dimension, an arbitrary -dimensional original feature vector can be represented as a new low -dimensional vector , and , and . In our proposed method we deal with shot or scene concept and take a key frame to represent the whole shot or scene. There are different algorithms implemented for extracting key frame; most of these algorithms depend on the motion part to extract the key frames. After extracting the key frames, we used DCT to obtain the DCT coefficients. PCA is applied on the DCT coefficients to select the most significant DCT coefficients.
4. Proposed Method
In this paper, first shot detection is used to extract the key frames from the input videos; shot detection is used because it is more sophisticated method for video summary [19, 22]. Secondly we construct and variables; variable represents the features of video shot represented by the DCT_PCA data, and variable determines that video shot is set manually. Then feature should be scaled to values between 0 and 1 using the following equation: The advantage of this scaling is to avoid features in greater numeric range dominating those in smaller numeric range. Another advantage is to avoid numerical difficulties during the calculation when large feature value might cause numerical problem. Finally WKLR is implemented on the prepared data for classification aiming to achieve significant accuracy and making WKLR to be an effective method for video classification.
4.1. Performance Measures
The predictions quality evaluated using five measures: Matthews correlation coefficient (MCC), , Predicted Positive Value (PPV), sensitivity, and specificity. FP: false positive, FN: false negative, TP: true positive, and TN: true negative. Matthews correlation coefficient can be in the range of −1 to 1, where 1 is a perfect correlation and −1 is the perfect anticorrelation. A value of 0 indicates no correlation. Its value can be calculated as follows:
is the percentage of correctly classified residues, also called the prediction accuracy. It is given as follows:
PPV is the predicted positive value, also called the precision or . It is given as follows:
Sensitivity is also called recall or and is the fraction of the total positive examples that are correctly predicted, and it is calculated as follows:
Specificity is the fraction of total negative examples that are correctly predicted; its value can be calculated as follows:
5. Experimental Results
The steps mentioned previously are tested on group of videos downloaded from youTube and Youku websites, containing the three categories, News, Sport, and Movies, and the selected frames of all those videos are almost about 10000 frames to test the accuracy of the proposed method. In the training phase, the data distributed randomly, the algorithm of WKLR MLE was implemented with 10-fold cross-validation, and max number of iterations is set to 30 for truncated Newton method.
(a) News ROC curve
(b) Sport ROC curve
(c) Movie ROC curve
As shown in Table 2, using KLR with same Gaussian kernel shows that the proposed approach could achieve considerable improvement in performance on small to medium size data sets; the average accuracy of our proposed method is 88%, compared to 86% accuracy of the work done in  in which the last paper proposed a new technique for improving unconstrained face recognition based on leveraging weakly labeled web videos. KLR has also advantages including that KLR has direct extension to multiclass classification and it can yield probabilistic output without intervention.
This paper proposed Weighted Kernel Logistic Regression (WKLR) in order to enhance the accuracy and processing time of video classification, which shows easy to implement and good acceptable results regarding to performance. One of the benefits of using WKLR is that it uses algorithms which are less complex for unconstrained optimization methods compared to algorithms used for constrained optimization methods such as SVM. An other benefit is that WKLR takes the advantage of speed of the IRLS and the power of the kernel methods. Future studies may compare this method using different kernels and test the algorithm on more datasets. They may also try to improve the speed of the algorithm. Methods such as the trust region Newton might also be utilized because of their stability and rigorous mathematical derivation.
A. Hervieu, P. Bouthemy, and J. P. L. Cadre, “Video event classification and detection using 2D trajectories,” in Proceedings of the 3rd International Conference on Computer Vision Theory and Applications (VISAPP '08), pp. 158–166, January 2008.View at: Google Scholar
J. H. Oh, J. Lee, and S. Kote, “Multimedia data mining framework for raw video sequences,” in Mining Multimedia and Complex Data, pp. 18–35, 2003.View at: Google Scholar
S. C. Chen, M.-L. Shyu, M. Chen, and C. Zhang, “A decision tree-based multimodal data mining framework for soccer goal detection,” in Proceedings of IEEE International Conference on Multimedia and Expo (ICME), pp. 265–268, IEEE, June 2004.View at: Google Scholar
J. Fan, H. Luo, J. Xiao, and L. Wu, “Semantic video classification and feature subset selection under context and concept uncertainty,” in Proceedings of the 4th ACM/IEEE Joint Conference on Digital Libraries; Global Reach and Diverse Impact (JCDL '04), pp. 192–201, IEEE, June 2004.View at: Google Scholar
K. Shirahama, Y. Matsuoka, and K. Uehara, “Video event retrieval from a small number of examples using rough set theory,” in Advances in Multimedia Modeling, pp. 96–106, 2011.View at: Google Scholar
A. Girgensohn and J. Foote, “Video classification using transform coefficients,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '99), pp. 3045–3048, IEEE, March 1999.View at: Google Scholar
C. A. Dhawale and S. Jain, “A novel approach towards keyframe selection for video summarization,” Asian Journal of Information Technology, vol. 7, no. 4, pp. 133–137, 2008.View at: Google Scholar
M. Heckmann, K. Kroschel, C. Savariaux et al., DCT-based video features for audio-visual speech recognition, 2002.
M. Mentzelopoulos and A. Psarrou, “Key-frame extraction algorithm using entropy difference,” in Proceedings of the 6th ACM SIGMM International Workshop on Multimedia Information Retrieval (MIR '04), pp. 39–45, ACM, October 2004.View at: Google Scholar
D. Rim, K. Hassan, and C. Pal, “Semi supervised learning in wild faces and videos,” in Proceedings of the British Machine Vision Conference, J. Hoey, S. McKenna, and E. Trucco, Eds., pp. 3.1–3.12, September 2011.View at: Google Scholar