Machine Learning in Intelligent Video and Automated MonitoringView this Special Issue
Research Article | Open Access
An Efficient Algorithm for Recognition of Human Actions
Recognition of human actions is an emerging need. Various researchers have endeavored to provide a solution to this problem. Some of the current state-of-the-art solutions are either inaccurate or computationally intensive while others require human intervention. In this paper a sufficiently accurate while computationally inexpensive solution is provided for the same problem. Image moments which are translation, rotation, and scale invariant are computed for a frame. A dynamic neural network is used to identify the patterns within the stream of image moments and hence recognize actions. Experiments show that the proposed model performs better than other competitive models.
Human action recognition is an important field in computer vision. The implications of robust human action recognition system, requiring minimal computations, include a wide array of potential applications such as sign language recognition, keyboard or a remote control emulation, human computer interaction, surveillance, and video analysis. Such systems are developed to enable a computer to intelligently recognize a stream of complex human actions being input via a digital camera. It thrives for the need of a multitude of efficiently designed algorithms pertaining to pattern recognition and computer vision. Background noise, camera motion, and position and shape of the object are major impairment factors against the resolution to this problem. This paper presents an efficient and sufficiently accurate algorithm for human action recognition making use of image moments. A comprehensive understanding of image moments describes characteristics information of an image. The proposed system aims to recognize human actions regardless of its position, scale, colors, size, and phase of the human. The paper describes a robust feature extraction and comprehensive classification and training processes. The primary focus is to facilitate video retrieval classified on the basis of featured human action. Inherently it requires methods to identify and discover objects of interest by providing comprehensive features after video segmentation, feature extraction, and feature vector organization. These features are designed such that they are immune to encumbrances such as noise and background view. This calls for methods incessantly capable of tackling video descriptors which are repeatable and most relevant. An efficient computational paradigm for extraction of such descriptors needs to be devised because only those areas of an image are matters of concern, which contain deciphering features. A real-time implementation is realized for detection of nominated human actions. Various researchers have addressed the proposed problem using different methodologies. Tran et al. represent human action as a combination of the movements of the body part . They provide a representation described by a combination of movements of the body part to which a certain action correlate. Their proposed method makes use of polar pattern of the space for representing the movement of the individual parts of the body. In another article Ali and Shah  represent kinematic functions computed from optical flow for the recognition of human action in video tribes. These kinematic features represent the spatiotemporal properties of the video. It further performs principal component analysis (PCA) on the kinematic feature volume. Multiple instance learning (MIL) is used for the purpose of classification of human action using succinct data after PCA. Busaryev and Doolittle recognize hand gestures captured from a webcam in real time. Such classification of gestures is applied to control real-world applications. Background subtraction and HSV-based extraction are compared as methods for getting a clean hand image for further analysis. The gesture in each hand image is then determined with Hu moments or a local feature classifier, and each gesture is mapped to a certain keystroke or mouse function . Cao et al. combine multiple features for action detection. They build a novel framework which combines GMM-based representation of STIPs based detection . In order to detect moving objects from complicated backgrounds, Zhang et al. improved Gaussian mixture model, which uses K-means clustering to initialize the model and gets better motion detection results for surveillance videos . They demonstrate that the proposed silhouette representation, namely, “envelope shape,” solves the viewpoint problem in surveillance videos. Shao et al. present a method that extracts histogram of oriented gradients (HOG) descriptors corresponding to primitive actions prototype . The output contains only the region of interest (ROI). Using this information the gradient of motion is computed for motion estimation. The gradient vectors are obtained for the partitioning of periodic effect. Once it detects a complete cycle of movement, two key frames are selected for encoding the motion. Finally, the current class action descriptors for the classification of features are extracted while the corresponding classifier is trained offline. Ullah et al. implemented the bag of features (BoF) approach for classification of human actions in realistic videos. The main idea is to segment videos into semantically meaningful regions (both spatially and temporally) and then to compute histogram of local features for each region separately [7, 8].
Certain weaknesses of the recognition algorithm for human actions in video with the kinematic features  and multiple instance learning are quite evident. Firstly the kinematic properties selected are not scale, translation, and rotation invariant, as the same action from different angles induces different optical flow. Secondly, occlusion presents serious consequences for the performance of the algorithm, especially in cases where a significant part of the body is closed. Moreover the training step is the slowest part of the algorithm which makes excessive use of memory due to its iterative nature. The method using the HSV model for segmentation of hands will have problems if another object of the same hue is present in the frame. Other methods using sparse representations of human action recognition cannot handle several actions in a video clip. This is because they do not take into account the spatial and temporal orientation of the extracted features. The method discussed in [9, 10] uses color intensities to segment the action by manually selecting a region. Using this approach a region must be selected every time when the scene changes; this undesirably requires human intervention. Furthermore, most of the algorithms work only for a specific illumination; it will fail to give results on high or low illumination. The approach used in  is based upon the assumption that each independent observation follows the same distribution. Certainly this approach is bound to fail in case the distribution of the observations is quite the reverse. Although the approach seems to be scale invariant still it is not rotation invariant.
The paper is organized into several sections. Section 1 gives a brief introduction of the problem and the current state of the art. Section 2 gives an overview of the proposed system. Section 3 describes the feature extraction process. Section 4 gives a comprehensive description of the training process. Section 5 provides some detailed results from the model while Section 6 adds some conclusive remarks.
2. An Overview of the Proposed System
The system is designed to retrieve semantically similar video clips for a given criterion video clip, from a large set of semantically different videos. The video dataset contains features of every proposed action and on query, video features will be extracted and matched with the stored features in the feature library. Since gestures are sequence of postures (static frames), therefore the system is expected to recognize gestures by identifying constituent postures one by one. Ultimately a temporal classifier is used to classify the input stream of spatial postures into an action. Figure 1 shows the flow of the initial training process. Firstly, individual frames are extracted from the video input. Secondly each extracted frame is preprocessed to make it suitable for moment extraction. These moments form a feature vector which is initially used for training of the system.
The system is exhaustively trained using the training process described later. A sufficiently trained system is deemed appropriate for classification of the proposed actions. Figure 2 shows the process used for classification of human actions.
Extracted features from a live video feed are fed into a trained dynamic neural network (DNN) for the purpose of classification. The neural network classifies the action performed in a few successive frames. The dynamic neural network is designed such that its behavior varies temporally based on the video input.
3. Preprocessing and Feature Extraction
Initially a number of preprocessing steps must be performed on video frames before moments based features are extracted. Computations of moments require that the image is of monochrome nature. The chromatic frame extracted from the video is firstly binarized using a threshold. The threshold is carefully chosen based on the mean illumination level of the frame. Mean illumination is computed by taking the mean of luminosity value of each pixel in the frame. Once binarized, the image will hold either black or white pixels. Further to remove noise and other impairments dilation and erosion is performed . Figures 3 and 4 show the result of this process on a sample frame.
Before any intricate processing is performed on the data set, the background is removed from each frame. Here two alternate approaches are adopted. In the first approach initial few frames are captured without any foreground action containing only the background. Any frame from this initial footage is used as a representative. This frame is subtracted from each frame containing foreground to obtain the filtered foreground. In the other approach each successive frame is XORed. The resultant frame represents the change in action during the period of the latter frame. The difference frame in this case also excludes the background.
3.1. Moments Extraction
Moments are scalar quantities which are used to categorize the shape and its features. They are computed from the shape boundary and its entire region. The concept of moments in images is quite similar to the concept of moments in physics. One major difference between the two is that image moments are inherently two-dimensional in nature. The resolution to the proposed problem is sought with the help of various moments such as raw, central, and scale invariant and rotation invariant moments along with certain corporeal properties of the image like the centroid and eccentricity. Invariant moments are those moments which are impervious to certain deformations in the shape and are most suited for comparison between two images. The scale, location, and rotation invariant moments are used to extract features regardless of size, position, and rotation, respectively.
3.2. Raw Moments
Raw moments are calculated along the origin of the image. Let be a function that defines an image where are any arbitrary coordinates of the image. In case of two-dimensional continuous signal the raw moment function for the moment of order is given as where is th pixel along -axis and th pixel along -axis and , are the th and th indices of the moments. These moments are computed throughout the span of the image. The raw moments provide information about properties like area and size of the image; for example, the moment will give the area of object.
3.3. Central Moments
The moments which are invariant to translation of objects in an image are called central moments as they are computed along the centroid rather than the origin. From the equation of raw moments central moments are calculated such that the first two order moments from (18), that is, and , are used to locate the centroid of the image.
Let be a digital image; then reducing the coordinates in previous equation by center of gravity ( and ) of the object we get The coordinates of the center of mass are the point of intersection of the lines and , parallel to the and -axis, where the first order moments are zero. The coordinates of the center of gravity are the components of the centroid given as follows: while Moments of order up to three are simplified in  and are given as follows: It is shown in  that the generalized form of central moments is The main advantage of central moments is their invariances to translations of the object. Therefore they are suited well to describe the form or shape of the object while the centroid pertains to information about the location of the object.
3.4. Scale Invariant Moments
The raw moments and the central moments depend on the size of object. This creates a problem when the same object is compared but both the images are captured from different distances. To deal with this encumbrance scale invariant moments are calculated. Moments are invariant to changes in scale and are obtained by dividing the central moment by scaled th moment as given in the following:
3.5. Rotational Invariant Moments
Rotational moments are those moments which are invariant to changes in scale and also in rotation. Most frequently used are the Hu set of invariant moments: All the moments discussed in this section are computed for each frame. The collection of the moments is used as a feature vector. This feature vector provides characteristic information about the contents of the frame in numerical form. The variation of patterns formed by periodic frames in a video defines the action being performed. Further a framework is presented capable of recognizing the hidden patterns within the stream of feature vectors for each defined human action [14–16].
4. Training the Network
A drawback of supervised training is that training data needs to be labeled. Initially each frame in the training video is assigned a class number. A specific number is assigned to each class, inherently; the frame related to any class will be given a class number. A target matrix is organized such that each column represents a label of a frame within the training data. Another input matrix is correspondingly organized in which each column contains the extracted moments of the frame. Further a neural network is designed such that neurons in the input layer could be clamped to each element in the obtained feature vector. The neurons in hidden layer are variable and will be changed to fine-tune the results, while the output layer has neurons equivalent to the number of designated classes. Moreover the network is of recurrent nature; that is, the output at output layer is recurrently clamped with the input as shown in Figure 5. Initially all the inputs and outputs of hidden layer are assigned random weights. Back propagation algorithm is used to adjust these weights and converge the output. This algorithm makes use of the sigmoid function for the training purpose given as The derivative of this function is given as The feature vector for each frame is fed into the input layer and the output is computed. Initially randomly assigned weights are used for each edge. The difference between the actual and labeled output determines the error. Back propagation technique back-tracks this error and readjusts the weights so that the error is reduced. The weights are adjusted in a backward direction. In case of proposed network weights are adjusted in the hidden layer and then the same is done for input layer. Several iterations are performed for each input until convergence is achieved and no appreciable change in weights is perceived.
Let the weight of an edge between an arbitrary neuron in input layer and an arbitrary neuron in hidden layer be given as while the weight of an edge between an arbitrary neuron in hidden and arbitrary neuron in output layer is given as . For each neuron in input layer the following operations are performed: where represents the number of input layer neurons and the threshold used by the th neuron in the hidden layer. Outputs at the hidden layer are given as follows: while is the threshold of the th neuron at the output layer, is the neuron output, and is the number of neurons in hidden layer.
The obtained feature vector for a single video frame is clamped to the neural network in order to produce an output . Here the difference between the desired and actual output is computed as the error given as while is the desired output.
Further error gradient is used to determine the incremental iterative change in the weight so that the actual output approaches the expected output. The error gradient is defined as the product of error and the rate of change in the actual output. Analytically it is given as Using the partial derivative of and putting it in above equation the following equation is formed: The weight of edges between input and hidden layer also needs to be adjusted. For this purpose the error gradient for hidden layer should also be calculated. In the back propagation techniques the errors are back-tracked. The error gradient at output layer is primarily used to calculate error gradient at hidden layer. Here, the following equation is used to calculate it: Using these error gradients the renewed weights for neuron at each layer are computed. The following equations are used: where is the learning rate. Generally it is a tiny positive value lesser than 1 and is adjustable according to the learning behavior of the network. Similarly the threshold used for computing the renewed weights should also be recalculated for the next iteration. The following equations are used to recalculate the weights: Equations (18) and (19) represent the threshold for arbitrary neuron in output and hidden layer, respectively. This method of gradient descent is quite effective and works splendidly for almost all sorts of problems. It has the capability to minimize the error and optimize the network to provide accurate results. Although the training process is iterative, it ultimately needs to terminate. This termination condition is indicated by the convergence of results. The result is said to have converged when no appreciable change in weights is possible. This termination condition is determined using the mean square error given as In the current problem a learning rate of was used. The output of a recurrent neural network is not only dependent on the current input but also dependent on the previous output. This recurrent nature of these networks makes them useful for problems which require a continuous input of dynamic data changing temporally as well. Identification of an action is not necessarily dependent on a single frame; rather previous and subsequent frames may also tell a story. Hence the use of recurrent network caters for the need for previous temporal information .
5. Results and Discussion
A large database of videos was collected containing hundreds of videos of varied length. Each video contained actions like(i)walking,(ii)clapping,(iii)hugging,(iv)single hand waving,(v)double hand waving,(vi)hand shaking.Figure 6 shows some of the sample actions. Several videos containing different actions were taken under varied conditions in terms of illumination, location, and background. Frame by frame extraction from these videos is performed. Each frame is firstly labeled in accordance with its semantic content manually. Each stream of frames belonging to a specified class is bifurcated and kept in a folder maintaining its sequence. Hence several samples of each action are segmented from the videos manually. Each sample is a stream of videos belonging to specific action. In the next step the background or the effect of background is removed from the frame. Two different strategies are followed for this purpose. With the first method, background is removed by firstly taking a blank frame which only contains the background and then subtracting this frame from the one containing a foreground. As a result background will be eliminated. The other method used for this purpose takes the difference of two successive frames. The resultant frame will contain just the change that occurred due to motion dynamics. Once the effect of background has been countered then for all resultant frames a corresponding feature vector is formed. The feature vector of a frame contains the raw, central, scale, and rotation invariant moments of the image besides its centroid and eccentricity. Tables 1, 2, 3, 4, and 5 show the quantified values of these features. The computed vector for each frame is fed into the recurrent neural network, iteratively training the network as described previously. The training stops when the mean square error is minimized. Not all the database is used for the training purpose. One-fourth of database samples are not used for training; rather they are reserved for testing. At the point when the model has been sufficiently trained it is time to test it. The remaining samples are similarly transformed into feature vectors and fed into the trained model. The accuracy of the model is based on its ability to correctly identify these untrained samples. Figure 7 represents the confusion matrix which shows that the overall accuracy of the system is 80.8%. Also it is noticed that the system is better able to recognize medial frames in an action rather than initial or terminal ones. The accuracy of the system is further increased to 95% if only the accuracy of recognition for the medial frames is considered.