Abstract

Human action recognition is an important area of human action recognition research. Focusing on the problem of self-occlusion in the field of human action recognition, a new adaptive occlusion state behavior recognition approach was presented based on Markov random field and probabilistic Latent Semantic Analysis (pLSA). Firstly, the Markov random field was used to represent the occlusion relationship between human body parts in terms an occlusion state variable by phase space obtained. Then, we proposed a hierarchical area variety model. Finally, we use the topic model of pLSA to recognize the human behavior. Experiments were performed on the KTH, Weizmann, and Humaneva dataset to test and evaluate the proposed method. The compared experiment results showed that what the proposed method can achieve was more effective than the compared methods.

1. Introduction

Automatic recognition of human actions from video is a challenging problem that has attracted the attention of researchers in the recent decades. It has applications in many areas such as entertainment, virtual reality, motion capture, sport training [1], medical biomechanical analysis, ergonomic analysis, human-computer interaction, surveillance and security, environmental control and monitoring, and patient monitoring systems.

Occlusion state recognition has been traditionally tackled by applying statistical prediction and inference methods. Unfortunately, basic numerical methods have proved to be insufficient when dealing with complex occlusion scenarios that present interactions between objects (e.g., occlusions, unions, or separations), modifications of the objects (e.g., deformations), and changes in the scene (e.g., illumination). These events are hard to manage and frequently result in tracking errors, such as track discontinuity, inconsistent track labeling.

The Pictorial structure method [2], which represents the human body as a set of linked rectangular regions, does not take occlusion into account. Sigal et al. [3] argue that the self-occlusion problem can be reduced by an occlusion-sensitive likelihood model. This works well if the occlusionstates (i.e., the depth ordering of parts) is known; for example, if it is specified at the start of the motion and then does not change over time. But, in practice, the depth order of object parts—for example, right arm, torso. Estimating 2D human pose is difficult because of image noises (e.g., illumination and background clutter), self-occlusion, and the varieties of human appearances (i.e., clothing, gender, and body shape) [35]. Estimating and tracking 3D human pose is even more challenging because of the large state space of the human body in 3D and our indirect knowledge of 3D depth [6]. In contrast, our approach focuses on self-occlusion. While all of the above methods are modeled to estimate poses from still images, there exists only limited research on the same task in videos. Guo et al. [7] applied the BOW model with human action recognition in video sequence. Niebles et al. [8] successfully applied this model to classify the video sequence of the human action. Wang and Mori [9] assigned each frame of an image sequence to a visual word by analyzing the motion of the person it contains. Sy et al. [10] applied the CRF with a hidden state structure to predict the label of the whole sequence of human gestures. Sigal et al. [3] modeled self-occlusion handling in the PS framework as a set of constraints on the occluded parts, which are extracted after performing background subtraction which renders it unsuitable for dynamic background scenes.

Our work follows literatures [3, 7, 9, 11] by producing a framework for articulated pose estimate-on robust to cluttered backgrounds and self-occlusion without relying on background subtraction models. The step of rectifying occluded body parts via a GPR model is inspired by recent work by Asthana et al. [12] who used GPR for modeling parametric correspondences between face models of different people. Our problem is more difficult because the human body includes more parameters to be rectified and has more degrees of freedom than faces.

In order to overcome the shortcomings mentioned above, we propose an adaptive self-occlusion state recognition method that estimates not only everybody configuration but also the occlusion states of body parts.

Firstly, the Markov random field was used to represent the occlusion relationship between human body parts in terms of occlusion state variable by phase space obtained. Then, we proposed a hierarchical area variety model. Finally, we infered human behavior by pLSA. Experiments on Human Eva data set were performed to test and evaluate the proposed algorithm. The experiment results have shown that the proposed method is effective in action recognition.

2. Human Trajectory Reconstruction

A tree structure movement of the human body skeleton structure is used by creating visual invariant model [13], the human body is divided into 15 key points; namely, 15 joint point represents the human body structure, and the 15 joints trajectory represents the human body behavior and then uses Markov random field (MRF) by calculating the observation, spatial relations, and the motion relationship and ultimately determines the occlusion positions of the body joints and restores the missing trajectory. Specific steps described below.

The Markov random field (MRF) was used with a state variable representing the occlusion relationship between body parts. Formally, the MRF was a graph , where was the set of nodes and was the set of edges. The graph nodes represented the state of a human body part and graph edges model the relationships between the parts [11]. The probability distribution over this graph was specified by the set of potentials defined over the set of edges. The MRF structural parameters are defined as follows: : The th joint point coordinates; : extract the key points of the body 15;   : the th joints visible parts, this parameter is used to determine occlusion relation between nodes. When occlusion occurred, trajectories intersected between   : the occlusion relation among the 15 body joints. When = 0, the th and the joints do not occluded. When = 1, the th occluded th. When = −1, the th occlude th; : the th occlude joints node; then, potential of kinematic relationship is calculated as follows:

This function indicates the position of two adjacent joints, and the angles among joints.

is the Eucidean distance between two adjacent joints. is the normal distribution with = 0 and standard deviation .

: occlusion area belong to joints; : If joint is occluded, = 1, if joint is not occluded, = 0; : input image; Indicator for overlapping body parts; : potential of observation; : potential of the color; : potential of the edge; : the motion state of (the th body joint) in the viewing area; : the motion state of (the th body joint) in the occluded area; : potential of observation; : potential of kinematic relationship; : potential of temporal relationship. Defining a model, similar to [12] for calculating three potential function as follows.

Firstly, we get the observation potential function:

The potential of the color where the first term is of probability of occurrence of color in the visible area and the second term is for the occluded area. The visible term is formulated as where and are the distributions of the color of pixel given the foreground and background. and is calculated as follows: : the occlusion area is determined by the calculated overlapping region of and , is the sum of all occlusion nodes.

When = 1, , where and are the lower and upper bound of motion area between and defined by kinesiology.

Finally, potential of temporal relationship is calculated as follows: where is the dynamics of at the previous time step and is a diagonal matrix with a diagonal element is identical to , which similar to a Gaussian distribution with the time.

In this paper, the posterior distribution of model conditioned on all input images up to the current joint structure, the current time step and occlusion state variable is where is a normalization constant.

In a word, we put , , into (4), and get body occluded joints positions, where is joint location at time.

The occluded relation among joints can be obtained by formula (2). where is position at time.

The occluded joints can be calculated by MRF at the entire time of motion. In this paper, we connect missing data in order to restore missing coordinate position.

3. Feature Representation

The human action can be recognized in terms of hierarchical area model, relative velocity, and relative acceleration.

3.1. Hierarchical Area Model

For describing the human motion pose (e.g., jogging, running, and walking), we make use of hierarchical area model and extract human facial area , upper limbs area and leg area . To human facial area are extracted in the following way.(1)According to Canny algorithm, each of the facial contour point set is extracted, and denoted as , where is the number of contour point.(2)The face contour can be least square fitting by , which obtained in step 1.(3)According to step 1 and step 2, if the body movement to make the front, the face area is the largest, if the human turned sideways, the face area will change. Thus, face area in coordinate is where is the frames, is the set of face contour in all frames, is the set of contour in all frames.(4)By Repeat Steps 1~3, the face area can be calculated in all frames.

Calculating and is similar to .

Figure 1 shows that the curve for some area features of pedestrian walking. Figure 1(a) is the area variation curve of . Figure 1(b) is the area variation curve of . Figure 1(b) is the area variation curve of .

3.2. Relative Velocity and Relative Acceleration

We can get the relative velocity and relative acceleration by the trajectory of each joint.

Each point’ weight can be considered as the same, and build statistical model to calculate the relative velocity and relative acceleration among relative motion joints (e.g., hands and legs) in order to reason the initial state of motion.

where is the relative velocity among and .

The area-velocity goodness is obtained as follow.T1: jogging, Δ (the left knee, the right knee), Δ (the left foot, the right foot), Δ (the right knee, the right foot), Δ (the left foot, the left ankle), Δ (the right foot, the right ankle) >1, and Δ (the left foot, the left knee) >2.T2: running, Δ (the left foot, the left knee), Δ (the right foot, the right knee), Δ (the left foot, the left ankle), Δ (the right foot, the right ankle) >3, and Δ (the left foot, the left knee), Δ (the left foot, the right knee), and Δ (the left foot, the right foot) >4.T3: walking, Δ (the left foot, the left knee), Δ (the right foot, the right knee), Δ (the left foot, the left ankle), and Δ (the right foot, the right ankle) >5.T4: jumping, Δ (the left foot, the left knee), Δ (the right foot, the right knee), Δ (the left foot, the left ankle), Δ (the right foot, the right ankle) >6, and Δ (the left foot, the left ankle), and Δ (the right foot, the right ankle) >7.T5: boxing, Δ (the left foot, the left knee), Δ (the right foot, the right knee), Δ (the left foot, the left ankle), Δ (the right foot, the right ankle) >8 and, Δ (the left hand, the left elbow), Δ (the right hand, the right elbow), Δ (the left foot, the left ankle), and Δ (the right foot, the right ankle) >9.

Thresholds are determined empirically as 1.5, 40, 5.5, 60, 3.5, 5.0, 40, 7.0, and 30.

We cluster the extract feature, which meet the threshold requirement, and extract the typical behavior of the action dataset as a standard action: jogging, running, walking, jumping and boxing. Above 5 kinds of common action decomposition, we get relative velocity among joints, when some action occurred. For example, an jogging operation, the relative velocity of the left leg and the right leg and the relative velocity of the left leg and the left knee are more than others joints.

3.3. Codebook Formulation

In order to construct the codebook, we use the -means algorithm based on the Euclidean distance to cluster all the features (hierarchical area model, relative velocity and relative acceleration) extracted from the training frames. The center of each cluster is defined as a codeword. All the centers clustered from the training frames produce the codebook for the pLSA model. A frame in the training videos or in the test videos is assigned to a specific codeword in the codebook which has the minimal Euclidean distance to the frame. In the end, a video is encoded in a bag-of-words way, that is, a video is represented using a histogram of codewords, removing the temporal information.

4. pLSA-Based Human Action Recognition

pLSA is a statistical generative model that associates documents and words via the latent topic variables, which represents each documents as a mixture of topics. Our approach uses the bag of words representation as in papers [1416]. What’s difference is that we use the local spatial-temporal maximum value of hierarchical area model, relative velocity and relative acceleration as our features. We suppose that the words are independent of the temporal order but related to the spatial order, for the -means clustering approach with all of the features may lead to the mismatch of the words. Similar local features appearing at different position may be clustered together. When we calculate the frequency of the words, the mismatch appears. And this phenomenon may reduce the precision of the classify approach. In order to solve the problem, we assign spatial information to each word. In the classify approach, we use the pLSA models to learn and recognize human action.

In the context of action categorization, the topic variable correspond to action categories, and each video can be treated as a collection of space-time words . The joint probability of video , action category and space-time word can be expressed as where is the probability of word occurring in action category , is the probability of topic occurring in video , and can be considered as the prior probability of . The conditional probability of can be obtained by marginalizing over all the topic variables : Denote as the occurrence of word in video , the prior probability can be modeled as A maximum likelihood estimation of and is obtained by maximizing the function using the Expectation Maximization (EM) algorithm, which the graph model is shown in Figure 2. The objective likelihood function of the EM algorithm is: The EM algorithm consists of two steps: an expectation (E) step computes the posterior probability of the latent variables, and a maximization (M) step maximizes the completed data likelihood computed based on the posterior probabilities obtained from E-step. Both steps of the EM algorithm for pLSA parameter estimate are listed below.

E-step: given and estimate

M-step: given the estimated in E-step, and , estimate and

For the task of human motion classification, our goal is to classify a new video to a specific activity class. During the inference stage, given a testing video test, the document specific coefficients .

We can treat each aspect in the pLSA model as one class of activity. So, the activity categorization is determined by the aspect corresponding to the highest . The action category of is determined as In this paper, we treat each frame in a video as a single word and a video as a document. The probability distribution can be regarded as the probability of each class label for a new video. The parameter in the training step defines the probability of a word drawing from an aspect . The aforementioned standard EM training procedure for pLSA is to replace with their optimal possible values at each iteration.

For action recognition with large amount of training data, this would result in long training time. This paper presents an incremental version of EM to speed up the training of PLSA without sacrificing performance accuracy. Assuming the observed data are independent of each other, we propose an incremental EM algorithm presented in Algorithm 1.

Algorithm 1. Incremental EM Algorithm for PLSA Parameter Estimation is as follows.(1)Inputs;(2)—the number of action categories;(3)—the number of training videos;(4)—the number of videos in each subset;(5)—the size of the codebook of spatial-temporal words;(6)Outputs;(7);(8);(9)E-Step;
for all and , calculate For all pairs and calculate M-Step: calculate the following: (10)Repeat E-steps and M-step until the convergence condition is met;(11)Calculate activity class

5. Experimental Result

5.1. Datasets

We test our algorithm on two datasets: the Weizmann human motion dataset [17], the KTH human action dataset [18, 19], and the HumanEva dataset [3, 20]. All the experiments are conducted on a Pentium 4 machine with 2 GB of RAM, using the implementation on MATLAB. The dataset and the related experimental results are presented in the following sections.

KTH datasets is provided by Schuldt which contains 2391 video sequences with 25 actors showing six actions. Each action is performed in 4 different scenarios.

The WEIZMANN datasets is provided by Blank which contains 93 video sequences showing nine different people, each performing ten actions, such as run, walk, skip, jumping-jack, jump-forward-on-two-legs, jump-in-place-on-two-legs, gallop sideways, wave-two-hands, wave-one-hand and bend.

The HumanEva dataset [3, 20] is used for evaluation. It contains six different motions: Walking, Jogging, Gestures, Boxing, and Combo.

In order to evaluate and fairly compare the performance, we use the same experimental setting as in [21, 22]. For every dataset, 12 video sequences taken by four subjects (out of the five) are used for training, and the remaining three videos for testing. The experiments are repeated five times.

The performance of different methods is shown using the average recognition rate. We report the overall accuracy on three datasets. In order to evaluate the performance of occlusion state estimation and reconstruct missing coordinate position, we hand-labeled the ground truth of the occlusion states for test motions. Figure 3 shows how the ground truth of occlusion state is specified.

5.2. Comparison

KTH Dataset. It contains six types of human actions (walking, jogging, running, boxing, hand waving, and hand clapping) performed several times by 25 subjects in four different scenarios: outdoors, outdoorswith scale variation, outdoors with different clothes, and indoors. Representative frames of this dataset are shown in Figure 4(a). After the process of restoring missing coordinateposition, we use the proposed method, theclassification results of KTH dataset obtained by this approach are shown in Figure 5 and indicate quite a small number of videos are misclassified, particularly, the actions, “running” and “handclapping,” are more tended to be confused.

The Weizmann Dataset. The Weizmann human action dataset contains 83 video sequences showing nine different people, andeach performing nine different actions: bending (a1), jumping jack (a2), juming forward on two legs (a3), jumping in place on two legs (a4), running (a5), galloping sideways (a6), walking (a7), waving one hand (a8), waving two hands (a9).

The figures were tracked and stabilized by using the background subtraction masks that come with this data set. Some sample frames are shown in Figure 4(b). The classified results achieved by this approach are shown in Figure 6.

The HumanEva Dataset. The HumanEva dataset is used for evaluation, which are shown in Figure 4(c). It contains five different motions: Walking (a1), Jogging (a2), Gestures (a3), Boxing (a4), and Combo (a5). Each motion is performed by four subjects and recorded by seven cameras (three RGB and four gray scale cameras) with the ground truth data of human joints. The classified results achieved by this approach are shown in Figure 7.

In this paper, we identify jogging, running, walking and boxing and compare the proposed method with the four state-of-the-art methods in the literature: Blank et al. [18], Lu et al. [19] Sigal et al. [3], Chang et al. [20] and Juan Carlos Niebles [21] in three dataset. As shown in the Tables 1, 2 and 3, the existing methods, the low recognition accuracy because these action are not only occlusion situation are complex, but also the legs have complex beat, motion and other group actions. The proposed method can overcome these problems, and the recognition accuracy and average accuracy are higher than the comparative method.

The experimental results show that the approach proposed in the paper can get satisfactory results and significantly performs better compared the average accuracy with that in [3, 1821], because of a practical method adopted in the paper.

6. Conclusions and Future Work

In this paper, we proposed an adaptive occlusion state estimation method for 3D human body movement.

Our method successfully recognize without assuming a known and fixed depth order. The proposed method can infer state variables efficiently because it separates the estimation procedure into body configuration estimation and occlusion state estimation. More specifically, in the occlusion state estimation step, at first, we reconstruct human trajectory reconstruction which representing the 3D human pose occlusion relationship and detect body parts having an occlusion relationship using the overlapping body parts by using a Markov random field (MRF) with a state variable. Finally, we use the topic model of pLSA to classify. Experimental results showed that the proposed method successfully estimates the occlusion states in the presence of self-occlusion and the average accuracy is about 92.5%, 90.1%, and 91.4% on the KTH dataset, Weizmann dataset, and HumanEva dataset respectively, which is better than other approaches [3, 1821].

We conjecture that the proposed method can be extended for tracking poses from (two or more) interacting people. Tracking poses of interacting people, however, will involve more complex problems such as dealing with more variable motion, inter-person occlusions, and possible appearance similarity of different people.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper (such as financial gain).

Acknowledgments

This research work was supported by the Grants from the Natural Science Foundation of China (no. 50808025) and the Doctoral Fund of China Ministry of Education (Grant no. 20090162110057).