Abstract

We propose an innovative approach for human activity recognition based on affine-invariant shape representation and SVM-based feature classification. In this approach, a compact computationally efficient affine-invariant representation of action shapes is developed by using affine moment invariants. Dynamic affine invariants are derived from the 3D spatiotemporal action volume and the average image created from the 3D volume and classified by an SVM classifier. On two standard benchmark action datasets (KTH and Weizmann datasets), the approach yields promising results that compare favorably with those previously reported in the literature, while maintaining real-time performance.

1. Introduction

Visual recognition and interpretation of human-induced actions and events are among the most active research areas in computer vision, pattern recognition, and image understanding communities [1]. Although a great deal of progress has been made in automatic recognition of human actions during the last two decades, the approaches proposed in the literature remain limited in their ability. This leads to a need for much research work to be conducted to address the ongoing challenges and develop more efficient approaches. It is clear that developing good algorithms for solving the problem of human action recognition would yield huge potential for a large number of potential applications, for example, the search and the structuring of large video archives, human-computer interaction, video surveillance, gesture recognition, and robot learning and control. In fact, the nonrigid nature of human body and clothes in video sequences, resulting from drastic illumination changes, changing in pose, and erratic motion patterns, presents the grand challenge to human detection and action recognition. In addition, while the real-time performance is a major concern in computer vision, especially for embedded computer vision systems, the majority of state-of-the-art human action recognition systems often employ sophisticated feature extraction and learning techniques, creating a barrier to the real-time performance of these systems. This suggests a trade-off between accuracy and real-time performance. The remainder of this paper commences by briefly reviewing the most relevant literature in this area of human action recognition in Section 2. Then, in Section 3, we describe the details of the proposed method for action recognition. The experimental results corroborating the proposed method effectiveness are presented and analyzed in Section 4. Finally, in Section 5, we conclude and mention possible future work.

2. The Literature Overview

Recent few years have witnessed a resurgence of interest in more research on the analysis and interpretation of human motion motivated by the rise of security concerns and increased ubiquity and affordability of digital media production equipment. Human action can generally be recognized using various visual cues such as motion [2, 3] and shape [4, 5]. Scanning the literature, one notices that a significant body of work in human action recognition focuses on using spatial-temporal key points and local feature descriptors [6]. The local features are extracted from the region around each key point detected by the key point detection process. These features are then quantized to provide a discrete set of visual words before they are fed into the classification module. Another thread of research is concerned with analyzing patterns of motion to recognize human actions. For instance, in [7], periodic motions are detected and classified to recognize actions. Alternatively, some researchers have opted to use both motion and shape cues. In [8], the authors detect the similarity between video segments using a space-time correlation model. In [9], Rodriguez et al. present a template-based approach using a Maximum Average Correlation Height (MACH) filter to capture intraclass variabilities. Likewise, a significant amount of work is targeted at modelling and understanding human motions by constructing elaborated temporal dynamic models [10]. There is also an attractive area of research which focuses on using generative topic models for visual recognition based on the so-called Bag-of-Words (BoW) model [11]. The underlying concept of a BoW is that each video sequence is represented by counting the number of occurrences of descriptor prototypes, so-called visual words. Topic models are built and then applied to the BoW representation. Three examples of commonly used topic models include Correlated Topic Models (CTMs) [11], Latent Dirichlet Allocation (LDA) [12], and probabilistic Latent Semantic Analysis (pLSA) [13].

3. Proposed Methodology

In this section, the proposed method for action recognition is described. The main steps of the framework are explained in detail along the following subsections.

3.1. Background Subtraction

In this paper, we use Gaussian Mixture Model (GMM) as a basis to model background distribution. Formally speaking, let be a pixel in the current frame , where is the frame index. Then, each pixel can be modeled separately by a mixture of Gaussians: Where is a Gaussian probability density function. , , and are the mean, covariance, and an estimate of the weight of the th Gaussian in the mixture at time, respectively.is the number of distributions, which is set to 5 in experiments. Before the foreground is detected, the background is updated (see [14] for details about the updating procedure). After the updates are done, the weightsare normalized. By applying a threshold (set to 0.6 in our experiments), the background distribution remains on top with the lowest variance, where

Finally, all pixels that match none of the components are good candidates to be marked as foreground. An example of GMM background subtraction can be seen in Figure 1.

3.2. Average Images from 3D Action Volumes

The 3D volume in the spatio-temporal () domain is formed by piling up the target region in the image sequences of one action cycle, which is used to partition the sequences for the spatiotemporal volume. An action cycle is a fundamental unit to describe the action. In this work, we assume that the spatio-temporal volume consists of a number of small voxels. The average image is defined as whereis the number of frames in action cycle (we use in our experiments). represents the density of the voxels at time . An example of average image created from the 3D spatio-temporal volume of the running sequence is shown in Figure 2. For characterizing these 2D average images, the 2D affine moment invariants are considered as features [26].

3.3. Feature Extraction

As is well known, the moments describe shape properties of an object as it appears. Affine moment invariants are moment-based descriptors, which are invariant under a general affine transform. Six affine moment invariants can be conventionally derived from the central moments [27] as follows: whereis the central moment of order .

For a spatio-temporal () space, the 3D moment of order of 3D object is derived using the same procedure of the 2D centralized moment: Where is the centroid of object in the spatio-temporal space. Based on the definition of the 3D moment in (5), six 3D affine moment invariants can be defined. The first two of these moment invariants are given by Due to their long formulae, the remaining four moment invariants are not displayed here (refer to [28]). Figure 3 shows a series of plots of 2D dynamic affine invariants with different action classes computed on the average images of action sequences.

3.4. Action Classification Using SVM

In this section, we formulate the action recognition task as a multiclass learning problem, where there is one class for each action, and the goal is to assign an action to an individual in each video sequence [1, 29]. There are various supervised learning algorithms by which action recognizer can be trained. Support Vector Machines (SVMs) are used in this work due to their outstanding generalization capability and reputation of a highly accurate paradigm [30]. SVMs that provide a best solution to data overfitting in neural networks are based on the structural risk minimization principle from computational theory. Originally, SVMs were designed to handle dichotomic classes in a higher dimensional space where a maximal separating hyperplane is created. On each side of this hyperplane, two parallel hyperplanes are conducted. Then, SVM attempts to find the separating hyperplane that maximizes the distance between the two parallel hyperplanes (see Figure 4). Intuitively, a good separation is achieved by the hyperplane having the largest distance. Hence, the larger the margin, the lower the generalization error of the classifier. Formally, let be a training dataset; Vapnik [30] shows that the problem is best addressed by allowing some examples to violate the margin constraints. These potential violations are formulated with some positive slack variablesand a penalty parameter that penalize the margin violations. Thus, the generalized optimal separating hyperplane is determined by solving the following quadratic programming problem: subject to .

Geometrically, is a vector going through the center and perpendicular to the separating hyperplane. The offset parameter is added to allow the margin to increase and not to force the hyperplane to pass through the origin that restricts the solution. For computational purposes, it is more convenient to solve SVM in its dual formulation. This can be accomplished by forming the Lagrangian and then optimizing over the Lagrange multiplier . The resulting decision function has weight vector , . The instances with are called support vectors, as they uniquely define the maximum margin hyperplane.

In the current approach, several classes of actions are created. Several one-versus-all SVM classifiers are trained using affine moment features extracted from action sequences in the training dataset. For each action sequence, a set of six 2D affine moment invariants is extracted from the average image. Also, another set of six 3D affine moment invariants is extracted from the spatio-temporal silhouette sequence. Then, SVM classifiers are trained on these features to learn various categories of actions.

4. Experiments and Results

To evaluate the proposed approach, two main experiments were carried out, and the results we achieved were compared with those reported by other state-of-the-art methods.

4.1. Experiment  1

We conducted this experiment using KTH action dataset [31]. To illustrate the effectiveness of the method, the obtained results are compared with those of other similar state-of-the-art methods. The KTH dataset contains action sequences, comprised of six types of human actions (i.e., walking, jogging, running, boxing, hand waving, and hand clapping). These actions are performed by a total of 25 individuals in four different settings (i.e., outdoors, outdoors with scale variation, outdoors with different clothes, and indoors). All sequences were acquired by a static camera at 25 fps and a spatial resolution of pixels over homogeneous backgrounds. To the best of our knowledge, there is no other similar dataset already available in the literature of sequences acquired on different environments. In order to prepare the experiments and to provide an unbiased estimation of the generalization abilities of the classification process, a set of sequences (of all sequences) performed by 18 subjects was used for training, and other sequences (the remaining ) performed by the other 7 subjects were set aside as a test set. SVMs with Gaussian radial basis function (RBF) kernel are trained on the training set, while the evaluation of the recognition performance is performed on the test set.

The confusion matrix that shows the recognition results achieved on the KTH action dataset is given in Table 1, while the comparison of the obtained results with those obtained by other methods available in the literature is shown in Table 3. As follows from the figures tabulated in Table 1, most actions are correctly classified. Furthermore, there is a high distinction between arm actions and leg actions. Most of the mistakes where confusions occur are between “jogging” and “running” actions and between “boxing” and “clapping” actions. This is intuitively plausible due to the fact of high similarity between each pair of these actions. From the comparison given by Table 3, it turns out that our method performs competitively with other state-of-the-art methods. It is pertinent to mention here that the state-of-the-art methods with which we compare our method have used the same dataset and the same experimental conditions; therefore, the comparison seems to be quite fair.

4.2. Experiment  2

This second experiment was conducted using the Weizmann action dataset provided by Blank et al. [32] in 2005, which contains a total of 90 video clips (i.e., 5098 frames) performed by 9 individuals. Each video clip contains one person performing an action. There are 10 categories of action involved in the dataset, namely, walking, running, jumping, jumping in place, bending, jacking, skipping, galloping sideways, one-hand waving, and two-hand waving. Typically, all the clips in the dataset are sampled at 25 Hz and last about 2 seconds with image frame size of . In order to provide an unbiased estimate of the generalization abilities of the proposed method, we have used the leave-one-out cross-validation (LOOCV) technique in the validation process. As the name suggests, this involves using a group of sequences from a single subject in the original dataset as the testing data and the remaining sequences as the training data. This is repeated such that each group of sequences in the dataset is used once as the validation. Again, as with the first experiment, SVMs with Gaussian RBF kernel are trained on the training set, while the evaluation of the recognition performance is performed on the test set.

The confusion matrix in Table 2 provides the recognition results obtained by the proposed method, where correct responses define the main diagonal. From the figures in the matrix, a number of points can be drawn. The majority of actions are correctly classified. An average recognition rate of 97.8% is achieved with our proposed method. What is more, there is a clear distinction between arm actions and leg actions. The mistakes where confusions occur are only between skip and jump actions and between jump and run actions. This intuitively seems to be reasonable due to the fact of high closeness or similarity among the actions in each pair of these actions. In order to quantify the effectiveness of the method, the obtained results are compared qualitatively with those obtained previously by other investigators. The outcome of this comparison is presented in Table 3. In the light of this comparison, one can see that the proposed method is competitive with the state-of-the-art methods. It is worthwhile to mention that all the methods that we compared our method with, except the method proposed in [21], have used similar experimental setups; thus, the comparison seems to be meaningful and fair. A final remark concerns the real-time performance of our approach. The proposed action recognizer runs at 18fps on average (using a 2.8 GHz Intel dual core machine with 4 GB of RAM, running 32-bit Windows 7 Professional).

5. Conclusion and Future Work

In this paper, we have introduced an approach for activity recognition based on affine moment invariants for activity representation and SVMs for feature classification. On two benchmark action datasets, the results obtained by the proposed approach were compared favorably with those published in the literature. The primary focus of our future work will be to investigate the empirical validation of the approach on more realistic datasets presenting many technical challenges in data handling, such as object articulation, occlusion, and significant background clutter.