Representation for Action Recognition Using Trajectory-Based Low-Level Local Feature and Mid-Level Motion Feature
The dense trajectories and low-level local features are widely used in action recognition recently. However, most of these methods ignore the motion part of action which is the key factor to distinguish the different human action. This paper proposes a new two-layer model of representation for action recognition by describing the video with low-level features and mid-level motion part model. Firstly, we encode the compensated flow (-flow) trajectory-based local features with Fisher Vector (FV) to retain the low-level characteristic of motion. Then, the motion parts are extracted by clustering the similar trajectories with spatiotemporal distance between trajectories. Finally the representation for action video is the concatenation of low-level descriptors encoding vector and motion part encoding vector. It is used as input to the LibSVM for action recognition. The experiment results demonstrate the improvements on J-HMDB and YouTube datasets, which obtain 67.4% and 87.6%, respectively.
Human action recognition has become a hot topic in the field of computer vision. It has developed a practical system which will be applied to video surveillance, interactive gaming, and video annotation. Despite remarkable research efforts and many encouraging advances in recent years [1–3], action recognition is still far from being satisfactory and practical. There are large factors affecting accurate rate of the recognition such as cluttered background, illumination, and occlusion.
Most action recognition focuses on two important issues: extracting features within a spatiotemporal volume and modeling the action patterns. Many existing researches on human action recognition tend to extract features from whole 3D videos using spatiotemporal interest points (STIP) . In recent years, optical flow is applied to extract the trajectory-based motion features, which have been widely used in local spatiotemporal features. Local trajectory-based features are pooled and normalized to a vector as the video global representation in action recognition. Meanwhile, a lot of work has focused on developing discriminative dictionary for image object recognition or video action recognition. The Bag of Feature (BOF) model generates simple video model by clustering spatiotemporal features of all the training samples and is trained using - Support Vector Machine (SVM). And the state of the art method is popular Fisher Vector (FV)  encoding model based on spatiotemporal local features. However, all these methods are not perfect, because they are only concerned about the low-level spatiotemporal features based on interest point and ignored the higher level features of motion part. For most actions, only a small subset of local motion features of the entire video is relevant to the action label. When a person is waving, only the movement around the arm or hand is responsible for the action clapping hand. Action Bank  and motionlets  adopt unsupervised learning to discover action parts. Many methods  cluster the trajectories and seek to understand spatiotemporal properties of movement to construct the mid-level action video representation. The Vector of Locally Aggregated Descriptors (VLAD)  is a descriptor encoding technique that aggregates the descriptors based on a locality criterion in the feature space. To keep more spatiotemporal characteristics of the processed motion part, VLAD encoding gets better results than BOF by . Inspired by low-level local feature encoding and mid-level motion part model are key factors to distinguish the different human actions; we propose a new representation (depicted in Figure 2) for action recognition based on local features and motion part in this paper. To reduce the background clutter noise, we extract the local trajectory-based features through a better compensated flow (-flow)  dense trajectories method. Then we cluster the trajectories through the graph clustering algorithm and encode the group features to describe the different motion part. Finally, we represent the video through combining the low-level trajectory-based features encoding model with mid-level motion part model.
This paper is organized as follows. In Section 2, the local descriptors based on the -flow dense trajectories and low-level video encoding with FV are introduced. Then we show clustering the motion part and introduce the representation for video in Section 3. We describe the evaluation of our approach and discuss the results in Section 4. Finally, the conclusion and future works are discussed in Section 5.
2. First Layer with FV
Trajectories are efficient in capturing object motions in videos. We extract spatiotemporal features along the -flow dense trajectories to express low-level descriptors. In this section we introduce the -flow dense trajectories and low-level descriptors with FV.
2.1. -Flow Dense Trajectory
The idea of dense trajectory is based on tracking the interest points. The interest points are sampled on a grid spaced by pixels and tracked in each frame. Points of subsequent frames are concatenated to form a trajectory: . is the position of interest points at frame . The length of a trajectory is frames . A recent work by Jain et al.  proposed the compensated flow (-flow) dense trajectories which reduce the impact of the background trajectories. The -flow dense trajectory is obtained by removing the affine flow vector from the original optical flow vector. The interest point of this method is tracked by -flow  for compensating dominant motion (camera motion). It is beneficial for most of the existing descriptors used for action recognition. This method uses the 2D polynomial affine motion model for compensating camera motion. The affine flow is the main movement of the two consecutive images which is usually caused by the movement of the camera. We compute the affine flow with the publicly available Motion2D software (http://www.irisa.fr/vista/Motion2D/) which implements a real-time robust multiresolution incremental estimation framework. The final flow vector at point is obtained by removing the affine flow vector from the original optical flow vector as follows.Figure 1 shows the dense trajectories extracted by the iDT  method and the -flow dense trajectories.
(a) iDT vectors
(b) -flow DT vectors
The shape of a trajectory encodes local motion patterns. The shape of a trajectory is described by concatenating a set of displacement vectors . Meanwhile, to leverage the motion information in dense trajectories, we compute descriptors within a spatiotemporal volume around the trajectory. The size of a volume is . And the volume is divided into a spatiotemporal grid. The Histograms of Optical Flow (HOF and -HOF)  descriptor captures the local motion information which is computed using the orientations and magnitudes of the flow field. Motion boundary histogram (MBH)  descriptor encodes the relative motion between pixels which is along both and image axis and describes the discriminatory features for the action recognition in background cluttering. The trajectory-based -HOF features is computed on the compensated flow. For each trajectory, the descriptors combine motion information HOF, -HOF, and MBH. The single trajectory feature is in the form of The trajectory shape is normalized by the sum of the magnitudes of the displacement vectors and is the length of trajectories.
2.2. Low-Level Video Encoding
The representation of video is a vital problem in action recognition. We first encode the low-level -flow trajectory-based descriptors using the Fisher Vector (FV) encoding method which was proposed for image categorization . FV is derived from Fisher kernel which encodes the statistics between video descriptors and Gaussian Mixture Model (GMM). We reduce the low-level features (-HOF, HOF, and MBH) dimensionality by PCA keeping the 90% energy. The local descriptors can be modeled by a probability density function with parameters , which is usually modeled by GMM. where are the model parameters denoting the weights, means, and diagonal covariances of GMM. is the number of local descriptors. is the number of mixture components and we set to 256 . We can compute the gradient of the log likelihood with respect to the parameters of the model to represent a video. FV requires a GMM of the encoded feature distribution. The Fisher Vector is the concatenation of these partial derivatives and describes in which direction the parameters of the model should be modified to best fit the data . To keep the low-level feature, we encode each video with the FV encoding feature.
3. Representation for Video
Motion part encoding has already been identified as a successful method to represent the video for action recognition. In this section, we use a graph clustering method to cluster the similarity trajectories into groups. Then representation for action video is concatenation of low-level local descriptors encoding and high-level motion part encoding.
3.1. Trajectories Group
To better describe the motion, we cluster the similarity trajectories into groups, because critical regions of the video are relevant to a specific action. In the method of , they compute a hierarchical clustering on trajectories to yield trajectories group of action parts. Then we apply that efficient greedy agglomerative hierarchical clustering procedure to group the trajectories. There are a large number of trajectories in a video; that is, there are a large number of nodes in graph. By removing trajectories distance which is not spatially close will get a sparse trajectories graph. Greedy agglomerative hierarchical clustering is a fast, scalable algorithm, with almost linear complexity in the number of nodes for relatively sparse trajectories’ graph. To group the trajectories we set trajectories distance matrix for a video containing trajectories. We use a distance metric between trajectories taking into consideration their spatial and temporal relations to cluster. Given two trajectories and , where and are the 2 distances of the trajectory points at corresponding time instances. We just calculate the distance between the trajectories and simultaneously existing in . To ensure the spatial compactness of the estimated groups, we enforce the above affinity to be zero for trajectory pairs that are not spatially close . The number of clusters in a video is set as the number used in  and the number of trajectories in a cluster is below the 100 based on empirical value.
3.2. Second Layer with VLAD
The trajectory group describing the motion part in the same action categories will have similarities. To capture the coarser spatiotemporal characteristics of the descriptors in the group , we compute the mean of group descriptors (-HOF, HOF, and MBH) and trajectory shapes. Then, we concatenate all the group descriptors (-HOF, HOF, and MBH) as and group shape as the group descriptors . So the group is described as ; VLAD  is a descriptor encoding technique that aggregates the descriptors based on a locality criterion in the feature space. As we know, the classic BOF uses the clustering centers statistics to represent the sample which will result in the loss of the lots of information. In group encoding, we denote the code words in the group codebook as . The group descriptors are all the group descriptors that belong to the th word. The video will be encoded as a vector: where is the size of codebook learned by the -means clustering. So the VLAD keeps more information than the BOF.
3.3. Video Encoding
We encode each video from the group descriptors of motion part using VLAD model. The codebook for each kind of group descriptors (-HOF, HOF, MBH, and ) was separately constructed by using -means cluster. According to the average number of groups in every video, we set the number of visual words to 50. In order to find the nearest center quickly we construct a KD-tree when each group descriptors are mapped to the codebook. We describe video encoding vector with the group model for different descriptors. Then, motion part model is encoded by the concatenation of different descriptors of the group VLAD model. Finally, the representation for action recognition is encoded by the concatenation of low-level local descriptors encoding and mid-level motion part encoding. Figure 2 shows an overview of our pipeline for action recognition.
In this section, we implement some experiments to evaluate the performance of representation for action. We validate our model on several action recognition benchmarks and compare our results with different methods.
We validate our model on three standard datasets for human action: KTH, J-HMDB, and YouTube dataset. The KTH dataset views actions in front of a uniform background, whereas the J-HMDB dataset  and YouTube dataset  are collected from a variety of sources ranging from digitized movies to YouTube. They cover different scales and difficulty levels for action recognition. We summarize them and the experimental protocols as follows.
The KTH dataset  contains 6 action categories: walking, handclapping, hand waving, jogging, running, and walking. The background is homogeneous and static in most sequences. We follow the experimental setting  dividing the dataset into the train set and test set. We train a multiclass classifier and report average accuracy over all classes as performance measure.
The J-HMDB  contains 21 action categories: brush hair, catch, clap, climb stairs, golf, jump, kick ball, pick, pour, pull-up, push, run, shoot ball, shoot bow, shoot gun, sit, stand, swing baseball, throw, walk, and wave. J-HMDB is a subset of the HMDB51 which is collected from the movies or Internet. This dataset excludes categories from HMDB51 that contain facial expressions like smiling and interactions with others such as shaking hands and focuses on single body action. We evaluate the J-HMDB which contains 11 categories involving one single body action. For multiclass classification, we use the one-vs-rest approach.
The YouTube Action dataset  contains 11 action categories: basketball, biking, diving, golf swinging, horse riding, soccer juggling, swinging, tennis swinging, trampoline jumping, volleyball spiking, and walking with a dog. Because of the large variations in camera motion, appearance, and pose, it is a challenging dataset. Following , we use leave-one-group-out cross-validation and report the average accuracy over all classes.
4.2. Experiment Result
The proposed method extract one-scale -flow trajectory-based local features through tracking dense sampling interest points and, then, cluster the trajectories into groups to encode motion part.
In order to choose a discriminative combination of features to represent the low-level local descriptors, we evaluate the low-level local descriptors based on -flow dense trajectories with Fisher Vector encoding in the first baseline experiment. GMM with 256 components is learned from a subset of 256,000 randomly selected trajectory-based local descriptors. Linear SVM with is used as classifier. We compare different feature descriptors in Figure 3 where the average accuracy on J-HMDB dataset is reported. It can be seen that MBH descriptors, encoding the relative motion between pixels, work better than other descriptors. Figure 3 also shows that the combination of HOF, -HOF, and MBH descriptors achieves 67.4%, which is the highest precision among all kinds of the low-level local descriptors. So, we use this combination in the second experiment.
In the second baseline experiment, the proposed two-layer model of the representation for action is the concatenation of low-level local descriptors and motion part descriptors encoding. Table 1 and Figure 4 compare the two-layer method with the low-level method for J-HMDB and YouTube datasets. It can be seen that the two-layer model had better performance than the low-level encoding using different descriptors. In addition, we compare the proposed method with a few classic methods on KTH, J-HMDB, and YouTube datasets, such as DT + BoVW , mid-level parts , traditional FV , stacked FV , DT + BOW , and IDT + FV . As shown in Table 2, the two-layer model obtains 67.4% and 87.6% accuracy on J-HMDB and YouTube datasets, respectively. And the recognition accuracy is improved by 4.6% on J-HMDB dataset and 2.2% on YouTube dataset compared with other state of the art methods. However, the performance on KTH dataset of the proposed method is not the same better as on the J-HMDB and YouTube datasets, because the KTH dataset is collected by the fixed camera with homogeneous background and the advantage of the -flow trajectories is not shown in this case.
(a) J-HMDB dataset
(b) YouTube dataset
This paper proposed a two-layer model of representation for action recognition based on local descriptors and motion part descriptors, which achieved an improvement compared to the low-level local descriptors. Not only did it consider making use of low-level local information to encoding the video, but also it combined the motion part to represent the video. It also presented a discriminative and compact representation for action recognition. However, there is still room for improvement. First, the proposed method cannot determine the number of groups in different datasets while the number of groups affects the performance of mid-level encoding a lot. Second, many groups in video do not represent the action part; it is needed to develop a method to learn the discriminately groups for better representation of the video. In the future, we will do research on new group clustering method which can find the more discriminative groups of action part.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Y. Wang, B. Wang, Y. Yu, Q. Dai, and Z. Tu, “Action-Gons: Action recognition with a discriminative dictionary of structured elements with varying granularity,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Preface, vol. 9007, pp. 259–274, 2015.View at: Publisher Site | Google Scholar
F. Perronnin, J. Sánchez, and T. Mensink, “Improving the fisher kernel for large-scale image classification,” in Proceedings of the 11th European Conference on Computer Vision (ECCV '10), vol. 6314 of Lecture Notes in Computer Science, pp. 143–156, Crete, Greece, 2010.View at: Publisher Site | Google Scholar
Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng, “Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '11), pp. 3361–3368, June 2011.View at: Publisher Site | Google Scholar
X. Peng, C. Zou, Y. Qiao, and Q. Peng, “Action recognition with stacked fisher vectors,” in Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V, vol. 8693 of Lecture Notes in Computer Science, pp. 581–595, Springer, Berlin, Germany, 2014.View at: Publisher Site | Google Scholar
N. Ikizler-Cinbis and S. Sclaroff, “Object, scene and actions: Combining multiple features for human action recognition,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Preface, vol. 6311, no. 1, pp. 494–507, 2010.View at: Publisher Site | Google Scholar
G. Cheng, Y. Wan, W. Santiteerakul, S. Tang, and B. P. Buckles, “Action recognition with temporal relationships,” in Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2013, pp. 671–675, Portland, OR, USA, June 2013.View at: Publisher Site | Google Scholar