Abstract

Video event detection is a challenging problem in many applications, such as video surveillance and video content analysis. In this paper, we propose a new framework to perceive high-level codewords by analyzing temporal relationship between different channels of video features. The low-level vocabulary words are firstly generated after different audio and visual feature extraction. A weighted undirected graph is constructed by exploring the Granger Causality between low-level words. Then, a greedy agglomerative graph-partitioning method is used to discover low-level word groups which have similar temporal pattern. The high-level codebooks representation is obtained by quantification of low-level words groups. Finally, multiple kernel learning, combined with our high-level codewords, is used to detect the video event. Extensive experimental results show that the proposed method achieves preferable results in video event detection.

1. Introduction

With the increasing popularity of digital cameras and mobile phones, more and more consumer-generated web videos recording real-life events are widely available on Internet. For example, more than 100 hours of videos are uploaded to YouTube every minute [1]. Consequently, how to effectively manage and retrieve the unconstrained consumer videos is becoming an urgent problem. In particular, video event recognition is receiving increasing attention in the field of computer vision [2]. However, it is an extremely difficult task due to the different video content and the variable conditions in lighting, camera motion, and occlusions. Figure 1 shows some representative frames from events “bird” defined in Columbia Consumer Video (CCV) Database [3]. We can see that the contents of these six videos are dramatically different, although they are all belonging to the same type of event.

The majority of existing event-recognition methods classified video mainly based on visual information. In general, various visual features of key frames were extracted for event classification [4, 5]. Some other event detection methods used high-level visual feature representation which modeled the relationship between low-level visual features and semantic concepts [6, 7]. But in fact, besides visual features, audio information of the same video also provides important cue for event recognition [8, 9].

To better describe the underlying causality in videos, in this work, we propose a high-level codebooks representation utilizing the Granger’s Causality [10] between different channels of features. First, the low-level visual feature and audio features are extracted, which are clustered to form visual bag-of-words (BoW) and audio BoW, respectively. To model the temporal causality between the two channels of information, the vocabulary representation of video sequence is viewed as the instantaneous of multivariate point process. By analyzing the Granger Causality between low-level audio and visual words, an undirected weighted graph is constructed to model the temporal causality of the videos. After that, we split the graph into low-level word groups which indicate the temporal patterns in videos. Finally the high-level codebooks are generated by quantifying low-level word groups, and then video event is detected based on multiple kernel learning framework (MKL) [11]. We evaluate our method on public datasets and perform the comparison with a number of other state-of-the-art methods. The experimental results illustrate that the method proposed in this paper can achieve preferable results in consumer videos. The general flowchart of our proposed method is shown in Figure 2.

In summary, the main contributions of this paper are (1) a proposed framework to perceive high-level codewords by taking the temporal causality between low-level features into account; (2) construction of a temporal relationship graph to extract high-level codewords; (3) utilization of the multiple kernel learning framework to detect video events.

The rest of the paper is organized as follows. We first review related works, especially the popular feature fusion method used in video event detection in Section 2. We continue with extraction of low-level video feature in Section 3. In Section 4, we propose a high-level codewords framework for event detection based on the temporal causality. Our experimental results on public datasets are provided in Section 5. The paper ends with conclusions and prospects for future work in Section 6.

Multiple feature fusion for multimedia analysis has been extensively studied. Compared to using only single feature, multifeature fusion has been proven to enhance the performance for multimedia content analysis. General speaking, early fusion and late fusion are the two popular ways for feature combination [12]. Early fusion concatenates features from different modalities into a single vector, while late fusion combines the results of different classifier to obtain the final classification score by a certain principle. However, the question on how to construct suitable joint feature and classifier combination still remains an open issue.

In the fields of machine learning, many researchers have been devoted to develop multiple view of learning algorithms to achieve multiple feature fusion. In [13], a multiple feature fusion algorithm is proposed by learning a generalized subspace in which canonical correlation between low-level features is measured. Oh et al. designed a multimedia event detection framework based on Latent SVM model which can learn high-level concepts [14]. Multistage feature strategy has been exploited by Natarajan et al. for complex event detection, such as multiple kernel learning, score level fusion, and weighted average fusion [15]. However, majority of these methods may need a large amount of label training data, but the real-world videos often lack exact labels, especially in consumer video. The semisupervised learning method has been proven to efficiently use unlabeled data to infer an accurate classifier [1618]. In [16], Yang et al. designed a hierarchical regression model to learning classifier which can utilize unlabeled data to represent multiple features. Recently, Ma et al. proposed a semisupervised learning framework with little-labeled training data by integrating multifeature learning and the Riemannian metric [17]. In [18], Xu et al. designed a cross-feature learning model for complex event detection based on the multilevel relevance learning of related exemplars.

Some other works concentrated on the use of audio-visual cue for tracking and recognition [19, 20]. Derbas and Quénot proposed an audio-visual feature representation to detect violent scenes in movies [19]. Ionescu et al. designed a content descriptor which includes audio and color content for video categorization [20]. However, the empirical results of these methods are subject to many qualifications, such as the category of the video and the environment of the video.

More recently, Jhuo et al. proposed an audio-visual bimodal representation for video event detection [21]. The audio-visual descriptors were firstly extracted to a construct bipartite graph discovering the joint probability of audio words and visual words. Bimodal words were then obtained by graph partition. Different from the above methods based on statistical relation, Prabhakar et al. firstly produced space-time dictionary by temporal causality for visual event analysis [22]. As an extension of this work, Jiang and Loui introduced an audio-visual grouplets representation method which uses the temporal audio-visual relation [23]. The author constructed four types of grouplets between the combination of foreground and background information. Despite the close relationship with our work, the above method requires visual foreground/background separation and audio background/foreground extraction, which remain extremely difficult and time consuming in consumer videos. In this paper, the proposed method is suitable for general Internet video and avoids region segmentation.

3. Low-Level Vocabulary Representation

BoW approach is a popular feature representation method which had been proven to be surprisingly effective in video analysis [3]. In this paper, two types of low-level features are extracted from training videos and then generate two types of BoW of videos. We used the following low-level descriptors in our work.

SIFT. Scale-invariant feature transform (SIFT) has been widely used in many researches of video content analysis, such as object recognition and video concept detection [24], since it is invariant to image scale, rotation, and changing viewpoints. In this paper, the difference of Gaussians operator was adopted to find local keypoint in the frames. Then, a 128-dimensional feature descriptor at each point was formed to capture the local gradients. In order to reduce of computation cost, we extracted features from sampled frames with a sample rate of 3 frames per second.

STIP. As an important cue for video content analysis, the popular spatial-temporal interest point (STIP) extracts the local space-time structure where the image values have significant local variations in both space and time [25]. In this paper, the Harris 3D detector was adopted to locate space-time volume. Each volume was subdivided into a () grids of cuboids, and then 4 bins histograms of gradients (HOG) and 5 bins histograms of optical flow (HOF) were computed from the grids. The parameters are set as same as paper [25], such as . Finally, we directly concatenated the HOG and HOF feature into a 162-dimensional vector which represents the local motion.

DTF. Dense trajectories feature (DTF) has been shown to be among the best visual feature in the application of video analysis [26]. Following the set in [26], we extracted the dense trajectories by the sampled feature points on a dense grid, and the trajectory descriptors were obtained by space-time volume around the trajectory. Finally we extracted 96-dimensional HOG feature of the trajectory.

MFCC. Acoustic features have been found to be very useful for various recognition systems. Among different acoustic features, mel-frequency cepstral coefficients (MFCC) [27], which collectively represents short-term power spectrum of sound based on a linear cosine transform, is one of the most prevalent choice for audio recognition. For each video, we extracted 36-dimensional MFCCs feature over 20 ms window size with 10 ms overlap.

Four low-level codebooks were generated by clustering the above features, respectively. For each video clip, the four features are quantified to form four BoW histogram representations. In order to discuss the temporal causality between low-level features, we directly concatenated different visual features to form visual BoW which provides the visual information of the video, while the MFCC BoW represents the audio information. They were used to extract high-level codewords as discussed in the next section.

4. High-Level Codewords Representation

In this section the high-level codewords representation based on Granger Causality is explained in detail. We first viewed each word in the video as a point process and analyzed the Granger Causality between low-level codewords. Then we constructed a weighted undirected graph based on the temporal relationship to extract high-level codewords. At the end, we used multiple kernel learning framework to detect video event.

4.1. Temporal Causality between Low-Level Codewords

Audio information is an important cue for video event detection. The emergence of some type of visual objects always accompanies some kinds of background sound. For example, the audio background of a basketball match is often the ball bouncing sounds, and the appearance of a dog in video often follows barking. Therefore, we need to first analyze the temporal causality between visual information and audio information and then detect the video event based on the audio-visual relevance.

Prabhakar et al. were the first to propose a method to describe the temporal causality between the visual words in videos by viewing the words sequence as multivariate point process [22]. Here we use and to represent the sets of audio and visual vocabulary, respectively, where and denote the number of audio and visual words. In order to investigate the cooccurrence of and , we compute the probability of each word and within each video frame. Firstly, the amount of emergence of word in the interval is defined aswhere denotes the time resolution. The mean intensity of the process is defined as . Then we consider the zero-mean process and rename that process . Therefore, all visual words create a -dimensional multivariate point process . Similarly, -dimensional multivariate point process can be created for visual words .

We use the method in [10] to estimate the Granger Causality between any visual point process and any audio point process . Firstly, the spectral matrix of the above two point processes is defined as follows:where elements represent the cross-spectrum between visual point process and audio point process . We used the multitaper method [28] to estimate the spectral matrix. In that method, data tapers are applied sequentially to the point processes and , and the Fourier transform of is taken as follows:The Fourier transform of , which denotes , can be computed as same as (3). Then, the spectral matrix element is estimated as follows:For the time series of multivariate point processes and , we adopt the autoregressive model to fit the data. The above is then factorized as follows:where is the transfer function determined by the coefficient matrix of the autoregressive model and is the joint covariance of the error terms in the autoregressive model. Finally, the Granger Causality from to is then estimated by the method developed in [29] towhere is all frequencies.

Notice that Granger Causality from to is not always equal to Grange Causality from to due to the directionality. Similarly, the Granger Causality from to is defined as follows:

Then the value of the Granger Causality between audio words and visual words is defined as the max value of two directions:

4.2. Construction Audio-Visual Graph with Temporal Attribution

For all of the training videos, we extracted the visual and audio features in Section 3 and then form visual words and audio words by -mean cluster method. In this section, we then define a weighted undirected graph to describe the causality between each word. Here is the set of vertices which are represented as follows:Each node in corresponds to a visual word or an audio word.

The set is defined to measure the concurrence relationship between each word in . The concurrence relationship between each word can be classified into three types, such as the relationship between visual words, the relationship between audio words, and the relationship between audio words and visual words. The Granger Causality between audio words and visual words is defined as (8). Similarly, the Granger Causality of the other types can be written as follows:

In order to reduce the computation cost, we used a statistic threshold to discover the causal relationship in the Granger Causality matrix. Here we adopted three different thresholds , , and for matrix , respectively. The value of Granger Causality scores that is less than the given threshold is regarded as a nontemporal relationship. After that, the score values that are larger than then threshold are normalized.

Based on above analysis, the weight of any edge of the undirected graph is defined as follows:where denotes the transpose of matrix .

4.3. High-Level Codewords Representation for Event Detection

For the audio-visual graph we constructed in Section 4.2, a greedy agglomerative graph-partitioning method [30] is adopted to extract low-level word groups. Given the partition of the vertex set into groups , the maximum intragroup similarity is defined as follows:where denotes sum of the weight of all edges in subset and denotes the sum of degree of all the vertex in subset .

We start hierarchical clustering based on an improved association matrix which is defined on each edge of the weighted graph. The element in the improved matrix is defined as follows:where and denote the different cluster in the graph. Initially, or is any vertex in the graph . In each stage of clustering, we select the vertex pair (), which has the maximum element in the matrix , to form the a larger cluster . Then, matrix is updated by removing the row and column related to and ; at the same time, new row and column which denote the cluster are inserted into matrix . In order to continue the next iteration steps, the weight matrix and improved matrix are updated as follows:

The problem of determining the number of cluster is important in graph partition. In this paper, we adopted an effective method to determine order selection after initial hierarchical clustering [30]. In each step of the clustering, we define a new metric to describe the similarity of the partition. The value of is defined as follows:where denotes the partition which has the maximum normalized association over the partition of vertex set into clusters. Then the number of cluster is defined as follows:In practice, the value of can be obtained by (16) or be provided by the user.

Each cluster in the graph partition forms a low-level word group which contains the temporal causality patterns between audio and visual features in the videos. And all the low-level groups form a high-level audio-visual dictionary, which is represented as . Each audio-visual is represented as the combination of the audio words subset and the visual words in those high-level codewords.

For a given video , the extracted visual feature and audio feature should be mapped into new audio-visual groups and then generate a high-level dictionary-based feature representation. Here we adopted an average pooling principle to aggregate original feature. The bag of high-level words is defined as follows:where and represent the number of audio words and visual words in the high-level codeword , denotes the th audio words, denotes the th visual words, denotes the value of the th bin in the audio words histogram representation of video , and means the value of the th bin in the visual words histogram representation of video . As seen from (17), the bag of high-level words representation is for all training videos, which is represented as follows:

4.4. Video Event Detection Based on Multiple Kernel Learning

Multiple kernel learning frameworks have been intensively applied in video analysis [11, 1618]. In this paper, we combine our high-level codewords into the common used simpleMKL algorithm [11]. Since our high-level codewords include the temporal causality between visual and audio words, it is very difficult to decide the optimal size of our codewords. We adopt different size of codewords representation in simpleMKL framework. The simpleMKL framework is defined to solve the following optimization problem:

Due to the diversity of consumer videos in practical application, only a few properly labeled training data is given. Recently, Xu et al. proposed an event detection method to solve the problem of unlabeled training data, which can discriminate the positive and negative exemplars by learning multirelevance level label [18]. The multirelevance levels learning problem is given as follows:The above learning problem can be reformulated asWe use the matrices as the basic kernels in MKL problem.

5. Experiment and Discussion

5.1. Experimental Setup

In this work, we evaluated our proposed high-level codewords representation for event detection over the large scale Columbia Consumer Video Dataset [3], which contains 9,317 consumer videos from YouTube (210 hours in total). These consumer videos contain diverse content without postediting, meanwhile the original audio tracks of the consumer videos are preserved. All of videos are manually labeled to 20 semantic categories. As same as the setting in [3], we use the same 4,659 videos for training and the remaining 4,658 videos for testing.

All our experiments were performed on a server machine with Intel Xeon 2.4 GHz CPUs and 32 GB RAM by using a single thread. For performance evaluation, we use average precision (AP, the area under precision-recall curve) and mean average precision (MAP, mean average precision across all event categories) as our evaluation metric [3].

In order to demonstrate the effectiveness of our method, we systematically perform the following methods:(1)Individual feature: we performed our experiments on the four features (SIFT, STIP, MFCC, and DTF); however we will only report the result of STIP and DFT.(2)Early fusion: in order to evaluate the influence of audio information, we compared the performance of different manners of audio and visual combination, such as SIFT + MFCC, DTF + MFCC, and SIFT + DFT + MFCC.(3)MKL based joint audio-visual codewords (MKL_AVC), where we use the joint audio-visual codewords in [21], especially, we just adopted the method of audio-visual graph construction in [21], and the method of graph partition is as described in this paper.(4)MKL based high-level codewords (MKL_HLC): we used simpleMKL framework [11] to combine our high-level codewords based on temporal causality.(5)Multilevel relevance labels and MKL based on high-level codewords (MLMKL_HLC): we used the multirelevance levels learning method in [18] to learn training label and then combined our high-level codewords to carry on event detection. In this experiment, each semantic category was labeled with -level, and label is for positive samples and label 1 is for negative samples. We fixed the parameter as 4 for the multirelevance levels.

5.2. Performance of Low-Level Features

In the experiments of evaluating the performance of low-level feature, we trained a classifier for each semantic category by adopting one-versus-all kernel SVM, which has been proven by its outstanding performance for classifying BoW-based features. The AP and MAP results are shown in Figure 4.

As for the individual feature experiment, we can see that the four individual features have different advantages across different categories. In Figure 3, we present only the MAP of DTF and STIP, which achieved better performance in the four individual features. It can be observed that our results fall behind with the results in [3]. This because the bag-of-words histogram used here is normal, while the primary spatial layout representation is used in [3].

As for the early fusion of individual feature, we can see that the combination of audio and visual feature representation through early fusion improves the detection result. For example, the AP of all categories is obviously improved by the combination of the three single features (SIFT + DFT + MFCC), and the MAP is improved by nearly 10% on a relative basis.

5.3. Performance of High-Level Features

In the experiments of evaluating the performance of high-level feature, we compared our high-level codewords and the audio-visual codewords in [21]. Furthermore, we evaluated the performance of our high-level codewords under the simpleMKL framework in [11] and the multirelevance levels learning MKL framework in [18], respectively. According to the results of Section 5.2, we just incorporated SIFT, DTF, and MFCC into our high-level codewords.

As for the performance of methods based on high-level feature, we can see that the three methods (MKL_AVC, MKL_HLC, and MLMKL_HLC) outperform the methods based on individual feature and feature combination method. Such results were within our expectations because of the importance of the relationship between low-level codewords. Particularly, our proposed method (MKL_HLC) outperforms the baseline method MKL_AVC by nearly 3% in terms of MAP, which proves the effectiveness of our proposed method. For instance, on events “dog,” our method (MKL_HLC) outperforms the individual feature STIP by 15% and outperforms the baseline method MKL_AVC by 9%. Besides, compared with the best baseline method MKL_AVC, our high-level codewords method achieves the highest relative performance gain on categories “birds” and “dogs.” This may be because the emergence of visual object (bird or dog) often accompanies with the bark or warble. However, our method’s performance is normal on the category “wedding reception,” and this may be due to the large amount of background noise following people’s actions. We also combine our high-level codewords into the multilevel relevance learning framework in [18]. We can see that MLMKL_HLC outperforms our MKL_HLC method by nearly 1% in terms of MAP, which indicate the effectiveness of the multirelevance levels learning in [18].

In general, we can expect a relative higher performance of the proposed method on other types of event which has obvious audio-visual association.

5.4. Codebook Analysis and Visualization

The different size of codebook can obviously impact the performance of event detection [21]. We hope each vocabulary can reflect a higher relativity between low-level words. Therefore, in the stage of high-level codewords representation (Section 4.3), the different number for order selection is manually selected. We compare the performance of different codebook sizes and different methods (MKL_AVC, MKL_HLC, and MLKL_HLC). The MAP performance is shown in Figure 5. We can see that the performance of the three methods gradually increases with the increasing codebook size. For this method, 6000 words seem to be the good choice for method MLMKL_HLC. The results of event detection in Section 5.3 are the performance using the best codebook size for our proposed method (MKL_HLC, MLMKL_HLC) and the baseline method (MKL_AVC).

We also compare the distribution of audio words and visual words in each high-level vocabulary of the two methods. For methods MKL_HLC and MKL_AVC, it is shown that the portion of audio-visual vocabulary, which contains both audio word and visual word, is found to be 45% and 34%, respectively. This proves that our high-level codewords can capture more association between audio word and visual word, compared to the bimodal words based on probability relationship in [21]. As indicated in the introduction of this paper, our high-level codewords are impactful for video events that contain audio-visual correlations. Figure 6 gives an example of this type of correlation. In the event “Birthday,” the appearance of cake and candle often accompany with the birthday song, and then in the end of the song, there are some sounds of clapping and cheering. Figure 7 shows the high-level codeword of that video. Visual words in those high-level codewords are shown as sampled local points in the frame which are extremely close to the codebook vocabulary. Also the audio words in the high-level codewords are shown as the spectrogram of the sound over 500 s windows, where the MFCC features in that window are similar to the codebook vocabulary.

It is also observed that there are large numbers of vocabularies which contain only visual words or only audio words. The existence of these single channel vocabularies is reasonable because not every visual word is correlated to another audio word. Specifically, the audio words or the visual words in our method, which compose the single channel vocabularies, are also grouped together by the Granger Causality between them. We think that the effective single channel vocabularies which have the similar temporal patterns are also important cues for event detection. Figure 8 illustrates the effect of single channel vocabularies which only include visual words. For the video sequence of the category “Wedding Dance,” the visual words are shown in the first row, and different color circles are used to represent different visual words. The temporal high-level codewords in our method are shown in the second row. From Figure 8, we can see that the large majority of visual words produced by the hag action of the two characters are grouped into the same temporal group.

6. Conclusion

In this paper, we have introduced a high-level codewords representation framework for video event detection which can effectively utilize the low-level features in the video. By viewing the set of low-level words as the instantiation of multivariate point-process, we developed a Granger Causality graph to model the relationship between the low-level words of the videos. Then the graph is partitioned into low-level words groups which have the similar temporal patterns. Extensive experiments consistently show that the proposed high-level codewords representation outperforms the state-of-the-art multimodal fusion method. With these findings we can conclude that high-level codewords model representation will play important role in the future video event detection system. At the same time, advanced model representation will be worth to be intensively studied in the future to meet the practical application needs.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work was supported by the Research Foundation of Education Bureau of Hunan Province, China (Grant no. 13C474).