Abstract

Despite their high stability and compactness, chord-length shape features have received relatively little attention in the human action recognition literature. In this paper, we present a new approach for human activity recognition, based on chord-length shape features. The most interesting contribution of this paper is twofold. We first show how a compact, computationally efficient shape descriptor; the chord-length shape features are constructed using 1-D chord-length functions. Second, we unfold how to use fuzzy membership functions to partition action snippets into a number of temporal states. On two benchmark action datasets (KTH and WEIZMANN), the approach yields promising results that compare favorably with those previously reported in the literature, while maintaining real-time performance.

1. Introduction

Recognizing human activities in video data is a paramount, but challenging task in computer vision and image understanding. It was concluded that developing efficient approaches and algorithms for solving the problem of human action/behavior recognition would yield huge potential for a large number of potential applications, for example, human-computer interaction, video surveillance, gesture recognition, robot learning and control, and so forth. In fact, the non-rigid nature of human body and clothes in video sequences, resulting from drastic illumination changes, changing in pose, and erratic motion patterns, presents the grand challenge to human detection and action recognition [1].

In addition, while the real-time performance is a major concern in computer vision, especially for embedded computer vision systems, the majority of state-of-the-art action recognition systems often employ sophisticated feature extraction and learning techniques, creating a barrier to the real-time performance of these systems. This suggests a tradeoff between accuracy and real-time requirements. The automatic recognition and understanding of human actions in video sequences are still an underdeveloped area due to the lack of a general purpose model and most approaches proposed in the literature remain limited in their ability. For this, much research still needs to be undertaken to address the ongoing challenges. The remaining paper is structured as follows. Section 2 gives the related work. In Section 3, the chord-length functions and chord-length features are described. Section 4 details the proposed action recognition method. Experimental results corroborating the efficiency of the proposed method are presented in Section 5. Finally, Section 6 concludes and outlines some prospects for future work.

Over the course of the last couple of decades or so, a great deal of work has been done (and still being done) on the recognition of human activities from both still images and video sequences. Despite these years of work, the problem is still open and provides a big challenge to the researchers and more rigorous research is needed to come around it. Human action can generally be recognized using various visual cues such as motion [1, 35] and shape [610]. Scanning the literature, one notices that a significant body of work in action recognition focuses on using spatial-temporal key points and local feature descriptors [1115]. The local features are extracted from the region around each key point detected by the key point detection process. These features are then quantized to provide a discrete set of visual words before they are fed into the classification module. Another thread of research is concerned with analyzing patterns of motion to recognize human actions. For instance, in [3], periodic motions are detected and classified to recognize actions. In [5] the authors analyze the periodic structure of optical flow patterns for gait recognition. Alternatively, some researchers have opted to use both motion and shape cues. For example, in [16], Bobick and Davis use temporal templates, including motion-energy images and motion-history images to recognize human movement. In [17] the authors detect the similarity between video segments using a space-time correlation model.While in [18], Rodriguez et al. present a template-based approach using a Maximum Average Correlation Height (MACH) filter to capture intraclass variabilities. Jhuang et al. [19] perform actions recognition by building a neurobiological model using spatio-temporal gradient. In [20], actions are recognized by training different SVM classifiers on the local features of shape and optical flow. In parallel, a significant amount of work is targeted at modelling and understanding human motions by constructing elaborated temporal dynamic models [2124]. Finally, there is also an attractive area of research that concentrates on using generative topic models for visual recognition based on the so-called Bag-of-Words (BoWs) model. The underlying concept of a BoW is that the video sequences are represented by counting the number of occurrences of descriptor prototypes, so-called visual words. Topic models are built and then applied to the BoW representation. Three of the most popularly used topic models are Latent Dirichlet Allocation (LDA) [25], Correlated Topic Models (CTMs) [26] and probabilistic Latent Semantic Analysis (pLSA) [27].

3. Chord-Length Functions

A shape border, that is, contour, is an inalienable property of every object and can be defined as a simply connected sequence consisting of 2 d points: where as is closed. The diameter of the shape boundary is given as where is defined as the Euclidean distance between two points and . Taking as an initial point , let the contour be traversed anticlockwisely and partitioned into arc segments of equal length, that is, where is the th division point and . Thus, we have chords: and lengths: where is the length of the chord measured as the Euclidean distance between the two points and , as shown in Figure 1. Now let us assume the point travels along the contour, then the chord lengths will vary accordingly. This implies that is a function of . Such a function is termed a chord-length function (CLF) and shortly denoted as [31]. Therefore we obtain CLFs, .  As those functions are obtained from splitting the contour evenly and from moving the initial point , along the contour, so that they guarantee to be invariant to translation and rotation. However, the chord length is not scale invariant, but it can be normalized to be invariant using the contour diameter .

The CLFs apparently meet the key requirements for being a shape descriptor. Then we need to scale all the CLFs to be within the same range (e.g., ). By their definition, the CLFs are derived by segmenting the contour evenly, so that it is easy to deduce that only half of the CLFs, are enough to describe the shape adequately. It is worthwhile here to point to the fact that both global and local features of a shape can be captured by using chord lengths of different levels. The local features are likely to be captured by the CLFs of the partition points closer to the initial point , while the global features are captured by those of farther points. This is the uncanny advantage of the CLFs versus other shape descriptors.

4. Suggested Methodology

The framework of the proposed action recognition system is schematically illustrated in Figure 2. In the following subsections, the steps of the scheme are described in more detail.

4.1. Preprocessing and Background Subtraction

For the later feature extraction or classification, preprocessing could provide more meaningful features that help in improving the final recognition results. First, all the frames of each action snippet are smoothed by using Gaussian convolution. Then backgrounds are subtracted from each action snippet using a Mixture-of-Gaussians (MoG) background modeling technique. For background substraction, a GMM background model analogous to that described in [32] is used. In this model, each pixel in the scene is modeled by a mixture of Gaussian distributions. Thus the probability that a certain pixel has intensity at time is given by where , , and are the weight, the mean, and the covariance of the th distribution at time , respectively,and is the Gaussian probability density function: Therefore, unfiltered silhouettes can be produced. Finally the shape borders representing all poses of a specific action are extracted from the filtered silhouette. These preprocessing operations are summarized in Figure 3.

4.2. Feature Extraction

Initially, we divide a video sequence into several temporal states to compensate the time warping effects. These states are defined by vague, linguistic intervals. Gaussian membership functions are used to describe the temporal intervals, where , , and are the center, width, and fuzzification factor, respectively, and is the total number of temporal states of action. Note that the membership functions defined above are chosen to be of identical shape on condition that their sum is equal to one at any instance of time, as shown in Figure 4. By using such fuzzy functions, not only can temporal information be easily extracted, the performance decline due to time warping effects can also be nullified.

4.2.1. Chord-Length Shape Features

As shown previously in Section 3, given a shape, CLFs can be defined by dividing the shape border into arcs of equal length. These functions are invariant to translation, rotation, and scaling. Though, like other shape descriptors, these descriptors are not sufficiently compact. Additionally, they depend constantly on a reference point whereby the shape border is parameterized. This dependence is simply because the contour is closed and any point on the contour can be used as a reference point, thus the CLFs might be changed. In order to avoid these problems and for convenience, the mean and variance of the CLFs are adopted, Hence, the CLF descriptor of shape can be expressed as follows: In order to obtain the CLF shape descriptor of a given action, we first obtain the CLF descriptor for all poses of this action. As each action snippet was temporally divided into a number of fuzzy states representing poses of the action, thus the CLF descriptor of an action pose is obtained by where and are the CLFs shape descriptor at time and the length of temporal state, respectively. Accordingly the final CLFs descriptor of the action can be constructed by concatenating all the CLFs shape descriptors of its temporal poses. The resulting feature vectors (i.e., CLFs descriptors) are then normalized to the integral value of unity. The normalized feature vectors obtained can be exploited as shape descriptors for classification and matching. Generally, many approaches in computer vision directly combine such normalized vectors to obtain the resultant feature vector per video clip, which in turn can be classified by any machine learning algorithm (SVM, ANN, NB, decision trees, etc.). In contrast, in this work, we aim to enrich these vectors by the self-similarity analysis. This is paramount to improve the ability to discriminate between temporal variations of different human actions.

4.2.2. Temporal Self-Similarities Construction

For comparing the similarity between two vectors, one can adopt several metrics (Euclidean metric, Cosine metric, Mahalanobis metric, etc.). Whilst such metrics might have some intrinsic merit, they have some limitations to be used with our approach because we might care more about the overall shape of expression profiles rather than the actual magnitudes, which is of main concern in applications such as action recognition. Therefore, we use a different similarity metric in which the trends and relative changes are considered. Such metric is based on Pearson Linear Correlation (PLC), where is the PLC between the two vectors and that is, defined as The means and of and are given by

Given a set of feature vectors that represent poses (or temporal states) of an action, the temporal self-similarity matrix of the action is given as where The main diagonal elements are zero because . Meanwhile, because , is a symmetric matrix. It is important to point out that the self-similarities matrix achieves the goal of reducing the dimensionality of the feature space from , to , without losing the relevant temporal information. For the present work, various values of were tried but was found to give the best results.

4.3. Fusing Motion Features with Shape Features

Global features of motion have proven to be advantageous in many applications of object recognition. This encourage us to extend the idea and fuse motion features and CLF features to form the final SVM model. The motion features extracted here are based on calculating the center of of gravity Figure 5 (i.e. shape centroid) that delivers the center of motion and is given by where the spatial coordinates of are given by where . Such features have profound implications, not only about the type of motion (e.g., translational or oscillatory), but also about the rate of motion (i.e., velocity). With these features, it would be able to distinguish, for example, between an action where motion occurs over a relatively large area (e.g., running) and an action localized in a smaller region, where only small parts of the body are in motion (e.g., boxing). It is worthwhile mentioning that fusing motion information with local features was very beneficial for the current action recognition task, and thereby a dramatic improvement in recognition accuracy was achieved.

4.4. Action Classification Using SVM

In this section, we formulate the action recognition task as a multiclass learning problem, where there is one class for each action, and the goal is to assign an action to an individual in each video sequence. There are various supervised learning algorithms by which an action recognizer can be trained.

Support Vector Machines (SVMs) are used in our framework due to their outstanding generalization capability and reputation of a highly accurate paradigm. SVMs [38] are based on the Structure Risk Minimization principle from computational theory and are a solution to data overfitting in neural networks. Originally, SVMs were designed to handle dichotomic classes in a higher dimensional space where a maximal separating hyperplane is created. On each side of this hyperplane, two parallel hyperplanes are conducted. Then SVM attempts to find the separating hyperplane that maximizes the distance between the two parallel hyperplanes. Intuitively, a good separation is achieved by the hyperplane having the largest distance (see Figure 6). Hence, the larger the margin is, the lower the generalization error of the classifier will be. More formally, let be a training dataset; Coretes and Vapnik stated in their paper [38] that this problem is best addressed by allowing some examples to violate the margin constraints. These potential violations are formulated using some positive slack variables and a penalty parameter that penalize the margin violations. Thus the optimal separating hyperplane is determined by solving the following primal quadratic programming (QP) problem: Geometrically, is a vector going through the center and perpendicular to the separating hyperplane. The offset parameter is added to allow the margin to increase and not to force the hyperplane to pass through the origin that restricts the solution. For computational purposes it is more convenient to solve SVM in its dual formulation. This can be accomplished by forming the Lagrangian and then optimizing over the Lagrange multiplier . The resulting decision function has weight vector . The instances with are called support vectors, as they uniquely define the maximum margin hyperplane.

In this approach, several classes of actions are created. Several one-versus-all SVM classifiers are trained using the features extracted from action snippets in the training dataset. The updiagonal elements of the temporal similarity matrix representing the shape features are first transformed into plain vectors based on the element scan order. The motion feature are then concatenated with the shape features to generate the final hybrid feature vectors. The dimension of final feature vector is . Finally, the final feature vectors are fed into the SVM classifiers for the final decision.

5. Experiments and Results

In this section the experiments we conducted to assess the performance of the proposed approach are described and some of their results are presented. And also in order to demonstrate the effectiveness of the proposed method, the obtained results are compared with those reported in the current literature. Two main experiments were carried out to evaluate this approach. The first one was carried out on the the publicly benchmark KTH action dataset [39], while the second one was conducted on the popular Weizmann action dataset [2].

5.1. Experiment 1

The KTH action dataset contains six types of human actions (i.e., walking, jogging, running, boxing, hand waving, and hand clapping), performed repeatedly by 25 individuals under four different scenarios including outdoors (s1), outdoors with scale variation (s2), outdoors with different clothes (s3), and indoors (s4). Typical example frames of six action categories in the KTH dataset can be seen in Figure 7. In order to prepare the experiments and to provide an unbiased estimation of the generalization abilities of the classification process, the sequences for each action were partitioned into two independent subsets, that is, a training set and a test set. More specifically, a set of sequences ( of all sequences) performed by 18 subjects were used for training and other sequences (the remaining ) performed by other 7 subjects were set aside as a test set. SVMs with Gaussian radial basis function (RBF) kernel are trained on the training set, while the evaluation of the recognition performance is performed on the test set. The confusion matrix that shows the recognition results achieved on the KTH action dataset is given in Table 1, while the comparison of the obtained results with those obtained by other methods available in the literature is shown in Table 2.

As follows from the figures tabulated in Table 1, most actions are correctly classified. Furthermore there is a high distinction between arm actions and leg actions. Most of the mistakes where confusions occur are between “jogging” and “running” actions and between “boxing” and “clapping” actions. This is intuitively plausible due to the fact of high similarity between each pair of these actions. From the comparison given by Table 2, it turns out that our method performs competitively with other state-of-the-art methods and its results compare favorably with previously published results.

5.2. Experiment 2

The Weizmann action dataset was first provided by Blank et al. [2] in 2005, which contains a total of 90 video clips (i.e., 5098 frames) performed by 9 individuals. Each video clip contains one person performing an action. There are 10 categories of actions involved in the dataset, namely, walking, running, jumping, jumping in place, bending, jacking, skipping, galloping sideways, one hand waving, and two-hand-waving. Typically, all the clips in the dataset are sampled at 25Hz and last about 2 seconds with image frame size of . A sample frame for each action in the Weizmann dataset is illustrated in Figure 8. In order to provide an unbiased estimate of the generalization abilities of the proposed method, we have used the leave-one-out cross-validation (LOOCV) technique in the validation process. As the name suggests, this involves using a group of sequences from a single subject in the original dataset as the testing data and the remaining sequences as the training data. This is repeated such that each group of sequences in the dataset is used once as the validation. More specifically, the sequences of 8 subjects were used for training and the sequences of the remaining subject were used for validation data. Again, as with the first experiment, SVMs with Gaussian RBF kernel are trained on the training set, while the evaluation of the recognition performance is performed on the test set.

The recognition results obtained by the proposed method are summarized in a confusion matrix in Table 3, where correct responses define the main diagonal. From the figures in the matrix, a number of points can be drawn. The majority of actions are correctly classified. An average recognition rate of 97.8% is achieved with our proposed method. What is more, there is a clear distinction between arm actions and leg actions. The mistakes where confusions occur are only between skip and jump actions and between jump and run actions. This intuitively seems to be reasonable due to the fact of high closeness or similarity among the actions in each pair of these actions. In order to quantify the effectiveness of the proposed method, the results obtained are compared qualitatively with those obtained previously by other investigators. The outcome of this comparison is presented in Table 4. In light of this comparison, we can see that the proposed method is competitive with the state-of-the-art methods. It is important to mention that all the methods [11, 3437] that we have compared our method with, except the method proposed in [33], have used similar experimental setups, so that the comparison seems to be meaningful and most fair. A final remark that we want to make here is that this approach is able to work at about 28 fps (using a 2.8 GHz Intel dual core machine with 4 GB of RAM). Therefore, it can offer timing guarantees to real-time applications and embedded systems.

6. Conclusion and Future Work

In this paper, we have introduced an approach for human activity recognition based on CLF shape features. On two benchmark action datasets, the results achieved by the approach have demonstrated that it leads to significant improvements in recognizing accuracy and efficiency and maintains competitiveness with existing state-of-the-art approaches. However, it would also be advantageous to explore the empirical validation of the approach on more realistic datasets presenting many technical challenges in data handling, such as object articulation, occlusion, and significant background clutter. These issues are crucial and thus will be more thoroughly investigated within the scope of future work.

Acknowledgments

This work is supported by Transregional Collaborative Research Centre SFB/TRR 62 “Companion-Technology for Cognitive Technical Systems” funded by DFG and BMBF Bernstein-Group (FKZ: 01GQ0702). The authors would also like to thank the anonymous reviewers for their constructive comments and insightful suggestions made on an earlier version of the paper that greatly contributed to improving the quality of this work.