Abstract

Human complex action recognition is an important research area of the action recognition. Among various obstacles to human complex action recognition, one of the most challenging is to deal with self-occlusion, where one body part occludes another one. This paper presents a new method of human complex action recognition, which is based on optical flow and correlated topic model (CTM). Firstly, the Markov random field was used to represent the occlusion relationship between human body parts in terms of an occlusion state variable. Secondly, the structure from motion (SFM) is used for reconstructing the missing data of point trajectories. Then, we can extract the key frame based on motion feature from optical flow and the ratios of the width and height are extracted by the human silhouette. Finally, we use the topic model of correlated topic model (CTM) to classify action. Experiments were performed on the KTH, Weizmann, and UIUC action dataset to test and evaluate the proposed method. The compared experiment results showed that the proposed method was more effective than compared methods.

1. Introduction

Automatic recognition of human actions from video is a challenging problem that has attracted the attention of researchers in recent decades. It has applications in many areas such as entertainment, virtual reality, motion capture, sport training [1], medical biomechanical analysis, ergonomic analysis, human-computer interaction, surveillance and security, environmental control and monitoring, and patient monitoring systems. Human complex action recognition is an important research field of the action recognition. Among various obstacles to human complex action recognition, one of the most challenging is to deal with “self-occlusion”, where one body part occludes another. Adaptive self-occlusion behavior recognition has been traditionally tackled by applying statistical prediction and inference methods [2]. Unfortunately, basic numerical methods have proved to be insufficient when dealing with complex occlusion scenarios that present interactions between objects (e.g., occlusions, unions, or separations), modifications of the objects (e.g., deformations), and changes in the scene (e.g., illumination). These events are hard to manage and frequently result in tracking errors, such as track discontinuity and inconsistent track labeling.

Our models are motivated by the recent success of “bag-of-words” representations for object recognition problems in computer vision. The common paradigm of these approaches consists of extracting local features from a collection of images, constructing a codebook of visual words by vector quantization, and building a probabilistic model to represent the collection of visual words. While these models of an object as a collection of local patches are certainly not “correct” ones, (e.g., they only model a few parts of objects and often ignore many structures), they have been demonstrated to be quite effective in object recognition tasks [35]. For the recognition approaches, some structures have been lost by moving to this representation. However, this representation is much simpler than the one that explicitly models temporal structures. there has been previous work (e.g., Yamato et al. [6], Bobick and Wilson [7], and Xiang and Gong [8]) that tries to model the full dynamics of videos using sophisticated probabilistic models (e.g., hidden Markov models and dynamic Bayesian networks). Li and Perona [9] use a variant of LDA for natural scene categorization. Sivic et al. [10] perform unsupervised learning of object categories using variants of the pLSA model. In this models, the “words” correspond to local patches extracted by interest point operators, and the “topics” correspond to the different object categories. Fergus et al. [11] extend pLSA to incorporate spatial information in a translation and scale-invariant manner and apply them to learn object categories from Google’s image search. Wang et al. [12] designs a simultaneous classification and annotation framework which extends from LDA and allows image feature and text word share the same dimensional topic space. Putthividhya et al. [13] propose a more general and flexible annotation model which allows different topic spaces for image feature and text word. Bissacco et al. [14] use LDA for human detection and pose classification. The “visual words” in their model are vector quantizations of histogram of oriented gradients in the training images. Niebles et al. [15] recently demonstrate some impressive results on unsupervised learning of human action categories using pLSA and LDA models for human action recognition. Wong et al. [16] adopt pLSA models to capture both semantic (content of parts) and structural (connection between parts) information for recognizing actions and inferring the locations of certain actions.

Optical flow-based action detection methods are well known [1719]. Efros et al. [20] recognize human actions at a distance in low resolution by introducing a motion descriptor based on optical flow measurements. Ahmad and Lee [21] propose a view independent recognition method by using the C-artesian component of optical flow velocity and human body shape feature vector information. Usually optical flow is used with other features, because it is noisy and inconsistent between frames [22]. Optical flow histograms have also been used to analyze the motion of individual behavior videos. The time series of histogram of optical flow has been modeled as a nonlinear dynamical system using Binet-Cauchykernels in [23]. However, this approach cannot deal with large motion, for example, a rapid move across frames.

In order to overcome the shortcomings mentioned above, we propose an adaptive self-occlusion action recognition method that not only estimates the occlusion states of body parts but also recognizes the occlusion behavior. Firstly, the Markov random field was used to represent the occlusion state of human body parts. Secondly, the structure from motion (SFM) is used for reconstructing the missing data of point trajectories. Then, we can extract the key frame based on motion feature from optical flow and the ratios of the width and height are extracted by the human silhouette. Finally, we use the topic model of correlated topic model (CTM) for action recognition. Experiments were performed on the KTH, Weizmann, and UIUC action dataset to test and evaluate the proposed method. The experiment results have shown that the proposed method is effective in action recognition.

The reminder of this paper is organized as follows. Section 2 presents the adaptive occlusion state estimation by Markov random field (MRF). In Section 3, we reconstruct the missing data of point trajectories by SFM (structure from motion). Section 4 explains feature representation. Section 5 explains algorithm of action models and the design of the classifier. Section 6 explains the results and analysis of the proposed approach. Finally, we conclude the paper in Section 7.

2. The Adaptive Occlusion State Estimation

The human body is divided into 15 key points, namely, 15 joint points representing the human body structure (torso, pelvis, left upper leg, left lower leg, left foot, rightupper leg, right lower leg, right foot, left upper arm, left lower arm, left hand, rightupper arm, right lower arm, right hand, and head) [24], which represent the human body behavior. In order to calculate the observation, spatial relations, and the motion relationship, we use Markov random field (MRF), which can determine the occlusion positions of the body joints.

In this paper, we use a state variable in the Markov random field (MRF) for representing the self-occlusion relationship between body parts. The MRF is a graph , where is the set of nodes and was the set of edges. The graph nodes represent the state of a human body part and graph edges model the relationships between the parts [25]. The probability distribution over this graph was specified by the set of potentials defined over the set of edges. The MRF structural parameters [24, 25] are defined as follows:

: the th joint point coordinates; : extract the key points of the body 15, (): the th joints visible parts, and this parameter is used to determine occlusion relation between nodes. When occlusion occurred, trajectories intersected between : the occlusion relation among the 15 body joints. When , the th and th joints are not occluded. When , the th occluded th. When , the th occluded th; : the th occluded joints node.

We apply the MRF model presented in [24, 25] to optimally estimate potential of kinematic relationship and similar to [26] for calculating three potential functions in video activity analysis. The potential of kinematic relationship and three potential functions were defined as follows.(1)Kinematic relationship is calculated This function indicates the position of two adjacent joints and the angles among joints. is the Euclidean distance between two adjacent joints. is the normal distribution with and standard deviation .(2)The Potential FunctionsThe observation potential function is where is the potential of observation, is the potential of the color, and is the potential of the edge, : input image.

The potential of the color is where the first term is of probability of occurrence of color in the visible area and the second term is for the occluded area as follows:: the motion state of (the th body joint) in the viewing area;: the motion state of (the th body joint) in the occluded area.

The visible term is formulated as and are the distributions of the color of pixel given the foreground and background: and is calculated as follows: : the occlusion area is determined by the calculated overlapping region of and and is the sum of all occlusion nodes. , , where and are the lower and upper bound of motion area between and defined by kinesiology.

Finally, the potential of temporal relationship is calculated as follows: where is the dynamics of at the previous time step and is a diagonal matrix with a diagonal element which is identical to , similar to a Gaussian distribution with the time.

In this paper, the posterior distribution of model conditioned on all input images up to the current joint structure, the current time step and occlusion state variable are where is a normalization constant.

In other words, we put into (4) and get body occluded joints positions: where is joint location at time.

The occluded relation among joints can be obtained by formula (2) as follows: where is position at time.

Therefore, the occluded positions can be calculated by MRF at the entire time of motion.

3. Reconstruct the Missing Data of Point Trajectories

We use the SFM (structure from motion) model to reconstruct the missing trajectories of the occluded joints [27]. Consider a set of point trajectories extracted from the parts of human body that rigidly moves in frames. By stacking each image trajectory in a single matrix of , it is possible to express the global motion, which represents the complete trajectory. We define it as where is the human body motion matrix and is the human body contour matrix in homogenous coordinates. Each frame-wise element , for , is a orthographic camera matrix that has to satisfy the metric constraints of the model (i.e., ). The 2-vector represents the 2D translation of the rigid object (in this paper, we consider the human body as a rigid object). We introduce the registered measurement matrix such that , where is a vector of ones and .

In the case of missing data due to occlusions, we define the binary mask matrix of size such that 1 represents a known entry and 0 denotes a missing one. In order to solve the components and thus the SFM problem, the equivalent optimization problem [28] can be defined as

Therefore, we can get to make up for the missing point of the trajectory. Figure 1 shows the reconstructed trajectory of the missing point.

4. Feature Representation

4.1. Motion Feature Extraction

The human action can be recognized in terms of hierarchical area model and relative velocity. In this paper, we use optical flow to detect the relative direction and magnitude of environmental motion observed in reference to an observer and also describe the movement of object from current image with respect to the last image. The optical flow [22, 29] equation can be assumed to hold for all pixels within a window centered at , the local image flow (velocity) vector must be satisfied, and we define some equations as follows. where are the pixels inside the window and , , and are the partial derivatives of the image with respect to position , and time , evaluated at the point and the current time. These equations can be written in matrix form , where

The optical flow vector field is further half-wave rectified into four nonnegative channels:

, so that

The motion descriptors of two different frames are compared using a version of the normalized correlation.

Suppose the four channels for frame of sequence are , , , and the four channels for frame of sequence are , , , ; then the similarity between frame and is where and are the temporal and spatial extent of the motion descriptors. In this paper, we choose . Therefore, we can obtain the key frames by the frames cluster. Figure 2 depicts results of key-frame extracted by the optical flow.

Then, we extract the feature from the key frame. We assume that the overall key frame number is , the lengths of silhouette in horizontal and vertical are and , respectively, and their ratio is , so we define it as

We assume that this area has a maximum value in the direction.

The length of the projection area in the direction is defined as the area width: where is the distance of pixels [31], is the binary image of the human body area, is the region in the perpendicular direction to the parallel lines intersecting the number of lines in group, and is the distance between the scan lines. In this paper, we use the eight search directions as shown in Figure 3. Therefore, distances among search lines, respectively, are

In order to implement the silhouette of human body area, a series of steps must be followed:(1)using Gaussian filter to eliminate any noise in video sequence;(2)using filter to find the edge strength, which estimates the gradient in the -direction and the other estimating the gradient in the -direction;(3)the direction of the edge is computed using the gradient in the and directions;(4)after the edge directions are known nonmaximum suppression now has to be applied. Nonmaximum suppression is used to trace along the edge in the edge direction and suppress any pixel value (sets it equal to 0) that is not considered to be an edge. This will give a thin line in the output;(5)threshold 1 is applied to the frames, and an edge has an average strength equal to 1. Any pixel in the frame that has a value greater than 1 is presumed to be an edge and is marked as such immediately and then any pixels that are connected to this edge pixel and that have a value. Figure 4 shows the silhouette of human body area.

Therefore, we can obtain the body contours by this step and use the optical flow descriptor and descriptor (the ratio of the width and height) to represent the video frames. To construct the codebook, we randomly select a subset from all the frames and compute the affinity matrix on this subset of frames, where each entry in the affinity matrix is the similarity between frame and frame calculated using the normalized correlation described above. Then, we run -medoid clustering on this affinity matrix to obtain clusters. Code-words are then defined as the centers of the obtained clusters.

4.2. Construct the Codebook

In the end, each video sequence is converted to the “bag of words” representation by replacing each frame by its corresponding codeword and removing the temporal information.

We follow the instruction from statistical text document analysis, and each image is represented as a bag of code-words. Given a training set of images with annotation words, we use the following notation. Each image is a collection of visual feature code-words, denoted as , where is a unit-basis vector of size with exactly one nonzero entry representing the index of current visual feature in the visual feature dictionary of size . Similarly, for one image annotated with words , we denote each word as a unit-basis vector of size again with only one taking values 1 and 0 otherwise; here is the word dictionary size. Therefore, a collection of training image word pairs can be denoted as .

5. Action Classification

In order to capture the correlation of topics, we model the hyperparameter of topic prior distribution as multivariate normal distribution instead of Dirichlet, similar to [29, 32] and the structure topics of dependencies by covariance matrix, and then we use the logistic normal function: to project the multivariate normal to topic proportions for each image; here is the topic number.

Let be a -dimensional mean and variance matrix from normal distribution, and let topics be multi-nomials over fixed vocabulary with size and let be multi-nomials over a fixed text word vocabulary with size . The CTM generates an image-word pair with image code-words and annotation words [29, 32] from the following generative process:(1)draw topic proportions (2)for each visual feature , (a)draw topic assignment (b)draw visual feature (3)for each textual word (a)draw feature index (b)draw textual word .

Firstly, we generate feature from correlated topic proportions , conditional on the topic feature multinomial , then for each of the text words, one of the features is selected and correspondingly assigned to a text word , conditional on the topic word multinomial . This model is shown as a directed graphical model in Figure 5.

From the generative process of CTM, it could be learned that topic correlations are modeled and generated through the covariance matrix of prior multivariate normal distribution. To learn the parameters of CTM that maximizes the likelihood of training data, we iteratively estimate the model parameters of latent variables. In this paper, the first step is to add and calculate a set of variational parameters to obtain the approximate lower-bound on likelihood of each sample. The latter step is to estimate the model parameters that maximize the log likelihood of the whole training samples. In the graphical representation of CTM in Figure 5, is conditional dependent on the parameter , which leads to intractability for computing the log likelihood. The mean-field variational distribution is where is variational mean and variance of normal distribution, is a variational multinomial over topics, and is a variational multinomial over codewords.

Let denote model parameters; we bound the log-likelihood of an image-annotation pair as follows: where is the expectation according to the variational distribution. Taking the log likelihood function as object function, we fit these parameters with coordinate ascent to maximize the object function:

Then, we update variational multinomial . The terms in with respect to are where is the multinomial constrain. is the constant.

The full variational inference procedure repeats the updates of (24), (25) until (23); the object function converges.

Then, we will obtain the parameter estimation. Given a collection of human action image data with annotation words , we find the maximum likelihood estimation for parameter .

We define ; the overall log likelihood of collection is bounded by

We maximize the lower boundary of , by plugging (23) into (17), and then update model parameters by setting derivation equal zero with respect to each model parameter.

The terms containing are

Setting leads to

The terms containing are

Setting leads to

The terms containing are

Setting leads to

The terms containing are

Setting leads to

If we obtain variational parameters for all the training samples, we can update the model parameters by plugging them into (30), (34), and (36) until the overall likelihood in (28) converges. Therefore, We approximate the conditional distribution of code-words as follows: The conditional probability can be treated as the predicted confidence score of each annotation word in word vocabulary, given the whole code-words of the unknown human behavior.

6. Experimental Result

6.1. Datasets

We test our algorithm on three datasets: the Weizmann human motion dataset [33], the KTH human action dataset [30], and the UIUC action dataset [34]. All the experiments are conducted on a Pentium 4 machine with 2 GB of RAM, using the implementation on MATLAB. The dataset and the related experimental results are presented in the following sections.

The KTH dataset is provided by Schuldt which contains 2391 video sequences with 25 actors showing six actions. Each action is performed in 4 different scenarios.

The WEIZMANN dataset is provided by Blank which contains 93 video sequences showing nine different people, each performing ten actions, such as run, walk, skip, jump-jack, jump-forward-on-two-legs, jump-in-place-on-two-legs, gallop-sideways, wave-two-hands, wave-one-hand, and bend.

The UIUC action dataset is created by the University of Illinois at Urbana-Champaign (UIUC) in 2008 for human activity recognition. The activities are walking, running, jumping, waving, jumping jacks, clapping, jumping from sit up, raising one hand, stretching out, turning, sitting to standing, crawling, pushing up, and standing to sitting.

For every dataset, 12 video sequences are taken by four subjects (out of the five) used for training and the remaining three videos for testing. The experiments are repeated five times. The performance of different methods is shown using the average recognition rate. In order to evaluate the performance of action recognition, we report the overall accuracy on three datasets.

6.2. Comparison

KTH Dataset. It contains six types of human actions (walking, jogging, running, boxing, hand waving, and hand clapping) performed several times by 25 subjects in four different scenarios: outdoors, outdoors with scale variation, outdoors with different clothes, and indoors. Representative frames of this dataset are shown in Figure 6(a). After process of restoring missing coordinate position, we use the proposed method, and the classification results of KTH dataset obtained by this approach are shown in Figure 7 and indicate quite a small number of videos are misclassified, particularly, the actions “running” and “handclapping” which more tend to be confused.

The Weizmann Dataset. The Weizmann human action dataset contains 83 video sequences showing nine different people and all performing nine different actions: bending (a1), jumping jack (a2), jumping forward on two legs (a3), jumping in place on two legs (a4), running (a5), galloping sideways (a6), walking (a7), waving one hand (a8), and waving two hands (a9). The figures were tracked and stabilized by using the background subtraction masks that come with this dataset. Some sample frames are shown in Figure 6(b). The classified results achieved by this approach are shown in Figure 8.

The UIUC Action Dataset. This dataset consists of 532 high resolution sequences of 14 activities performed by eight actors. The activities are walking, running, jumping, waving, jumping jacks, clapping, jumping from sit-up, raising one hand, stretching out, turning, sitting to standing, crawling, pushing up, and standing to sitting. Some sample frames are shown in Figure 6(c). The classified results achieved by this approach are shown in Figure 9.

In this paper, we identify jogging, running, walking, and boxing and compare the proposed method with the four state-of-the-art methods in the literature: Zhang and Gong [35], Gong et al. [36], and Chang et al. [37] in three datasets. As shown in Tables 1, 2, and 3, the existing methods, the low recognition accuracy because these action are not only occlusion situation are complex, but also the legs have complex beat, motion and other group actions. The proposed method can overcome these problems, and the recognition accuracy and average accuracy are higher than the comparative method.

The experimental results show that the approach proposed in the paper can get satisfactory results and significantly performs better comparing the average accuracy with that in [3537] because of a practical method adopted in the paper.

7. Conclusions and Future Work

In this paper, we proposed an adaptive occlusion action recognition method for human body movement. Our method successfully recognizes without assuming a known and fixed depth order. We have presented the MRF model and SFM model, which estimates the adaptive occlusion state and recovers the important missing parts of objects in a video clip. This paper presents a new method of human self-occlusion behavior recognition, which is based on optical flow and correlated topic model (CTM). Then, we have employed the optical flow motion feature to extract the key frame and calculated the ratio of the width and height from human silhouette. Finally, we use the topic model of correlated topic model (CTM) to classify and recognize action. Experiments were performed on the KTH, Weizmann, and UIUC action datasets to test and evaluate the proposed method. The compared experiment results showed that the proposed method was more effective than compared methods and better than other approaches [3537].

Future work will deal with adding complex event detection to the proposed system, involving more complex problems such as dealing with more variable motion, interperson occlusions, and possible appearance similarity of different people and increasing the database size.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper (such as financial gain).

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant no. 51278068) and by the Science and Technology Project of Hunan (Grant no. 2013GK3012).