Abstract

Human action recognition based on 3D skeleton has become an active research field in recent years with the recently developed commodity depth sensors. Most published methods analyze an entire 3D depth data, construct mid-level part representations, or use trajectory descriptor of spatial-temporal interest point for recognizing human activities. Unlike previous work, a novel and simple action representation is proposed in this paper which models the action as a sequence of inconsecutive and discriminative skeleton poses, named as key skeleton poses. The pairwise relative positions of skeleton joints are used as feature of the skeleton poses which are mined with the aid of the latent support vector machine (latent SVM). The advantage of our method is resisting against intraclass variation such as noise and large nonlinear temporal deformation of human action. We evaluate the proposed approach on three benchmark action datasets captured by Kinect devices: MSR Action 3D dataset, UTKinect Action dataset, and Florence 3D Action dataset. The detailed experimental results demonstrate that the proposed approach achieves superior performance to the state-of-the-art skeleton-based action recognition methods.

1. Introduction

The task of automatic human action recognition has been studied over the last few decades as an important area of computer vision research. It has many applications including video surveillance, human computer interfaces, sports video analysis, and video retrieval. Despite remarkable research efforts and many encouraging advances in the past decade, accurate recognition of the human actions is still a quite challenging task [1].

In traditional RGB videos, human action recognition mainly focuses on analyzing spatiotemporal volumes and representation of spatiotemporal volumes. According to the variety of visual spatiotemporal descriptors, human action recognition work can be classified into three categories. The first category is local spatiotemporal descriptors. An action recognition method first detects interesting points (e.g., STIPs [2] or trajectories [3]) and then computes descriptors (e.g., HOG/HOF [2] and HOG3D [4]) based on the detected local motion volumes. These local features are then combined (e.g., bag-of-words) to represent actions. The second category is global spatiotemporal templates that represent the entire action. A variety of image measurements have been proposed to populate such templates, including optical flow and spatiotemporal orientations [5, 6] descriptors. Except the local and holistic representational method, the third category is mid-level part representations which model moderate portions of the action. Here, parts have been proposed which capture a neighborhood of spacetime [7, 8] or a spatial key frame [9]. These representations attempt to balance the tradeoff between generality exhibited by small patches, for example, visual words, and the specificity by large ones, for example, holistic templates. In addition, with the advent of inexpensive RGB-depth sensors such as Microsoft Kinect [10], a lot of efforts have been made to extract features for action recognition in depth data and skeletons. Reference [11] represents each depth frame as a bag of 3D points along the human silhouette and utilizes HMM to model the temporal dynamics. Reference [12] learns semilocal features automatically from the data with an efficient random sampling approach. Reference [13] selects most informative joints based on the discriminative measures of each joint. Inspired by [14], Seidenari et al. model the movements of the human body using kinematic chains and perform action recognition by Nearest-Neighbor classifier [15]. In [16], skeleton sequences are represented as trajectories in an -dimensional space; then these trajectories are then interpreted in a Riemannian manifold (shape space). Recognition is finally performed using NN classification on this manifold. Reference [17] extracts a sparse set of active joint coordinates and maps these coordinates to lower-dimensional linear manifold before training an SVM classifier. The methods above generally extract the spatial-temporal representation of the skeleton sequences with well-designed handcrafted features. Recently, with the developing of deep learning, several Recurrent Neural Networks (RNN) models have been proposed for action recognition. In order to recognize actions according to the relative motion between limbs and the trunk, [18] uses an end-to-end hierarchical RNN for skeleton-based action recognition. Reference [19] uses skeleton sequences to regularize the learning of Long Short Term Memory (LSTM), which is grounded via deep Convolutional Neural Network (DCNN) onto the video for action recognition.

Most of the above methods relied on entire video sequences (RGB or RGBD) to perform action recognition, in which spatiotemporal volumes were always selected as representative feature of action. These methods will suffer from sensitivity to intraclass variation such as temporal scale or partial occlusions. For example, Figure 1 shows that two athletes perform some different poses when diving water, which makes the spatiotemporal volumes different. Motivated by this case, the question we seek to answer in this paper is whether a few inconsecutive key skeleton poses are enough to perform action recognition. As far as we know, this is an unresolved issue, which has not yet been systematically investigated. In our early work [20], it has been proven that some human actions could be recognized with only a few inconsecutive and discriminative frames for RGB video sequences. Related to our work, very short snippets [9] and discriminative action-specific patches [21] are proposed as representation of specific action. However, in contrast to our method, these two methods focused on consecutive frame.

In this paper, a novel framework is proposed for action recognition in which key skeleton poses are selected as representation of action in RGBD video sequences. In order to make our method more robust to translation, rotation, and scaling, Procrustes analysis [22] is conducted on 3D skeleton joint data. Then, the pairwise relative positions of the 3D skeleton joints are computed as discriminative features to represent the human movement. Finally, key skeleton poses, defined as the most representative skeleton model of the action, are mined from the 3D skeleton videos with the help of latent support vector machine (latent SVM) [23]. In early exploration experiments, we noticed that the number of the inconsecutive key skeleton poses is no smaller than 4. During testing, the temporal position and similarity of each of the key poses are compared with the model of the action. The proposed approach has been evaluated on three benchmark datasets: MSR Action 3D [24] dataset, UTKinect Action dataset [25], and Florence 3D Action dataset [26]; all are captured with Kinect devices. Experimental results demonstrate that the proposed approach achieves better recognition accuracy than a few existing methods. The remainder of this paper is organized as follows. The proposed approach is elaborated in Section 2 including the feature extracting, key poses mining, and action recognizing. Experimental results are shown and analyzed in Section 3. Finally, we conclude this paper in Section 4.

2. Proposed Approach

Due to the large performance variation of an action, the appearance, temporal structure, and motion cues exhibit large intraclass variability. So selecting the inconsecutive and discriminative key poses is a promising method to represent the action. In this section, we answer the question of what are and how to find the discriminative key poses.

2.1. Definition of the Key Poses and Model Structure

The structure of the proposed approach is shown in Figure 2. Each action model is composed of a few key poses, and each key pose in the model will be represented by three parts: () a linear classifier which can discriminate the key pose from the others, () the temporal position and offset , where the key poses are most likely to appear in the neighborhood of with radius , and () the weight of linear classifier and weight of the temporal information .

Given is a video that contains frames , where is the -th frame of the video. The score will be computed as follows: in which is the set of key poses of video , , and . For example, is in Figure 3(a). is the total number of key poses in the action model; in our following experiment, is ranging from 1 to 20. is the serial number of the key pose in the sequence of frames of video. And is defined as follows: in which is the frame at which action begins. is a Gaussian function and reaches peak when . has been manually labeled on the training set. The method of finding in a testing will be discussed in Section 2.4.

2.2. Feature Extracting and Linear Classifier

With the help of real-time skeleton estimation algorithm, the 3D joint positions are employed to characterize the motion of the human body. Following the methods [1], we also represent the human movement as the pairwise relative positions of the joints.

For a human skeleton, joint positions are tracked by the skeleton estimation algorithm and each joint has 3 coordinates at each frame. The coordinates are normalized based on Procrustes analysis [22], so that the motion is invariant to the initial body orientation and the body size. For a given frame , is the number of joints. The feature of this frame isAnd the feature is a 630-dimension (570 pairwise relative positions of the joint and 60 joint position coordinates) vector for MSR Action 3D and UTKinect Action dataset. AS for Florence 3D Action dataset, it is a 360-dimension vector. (The selection of alternative feature representations will be discussed in Experiment Result.) Then, we train a linear classifier for each key pose according to the following equation: The question of which frame should be used for training will be discussed in Section 2.3.

2.3. Latent Key Poses Mining

It is not easy to decide which frames contain the key poses, because key poses’ space is too large to enumerate all the possible poses. Enlightened by [23], since the key pose positions are not observable in the training data, we formulate the learning problem as a latent structural SVM, regarding the key pose positions as the latent variable.

Rewrite (1) as follows: in which is treated as the latent variable. Given a labeled set , where and , the objective is to minimize the objective function: in which is the penalty parameter. Following [23], the model is first initialized: and are the positive and negative subsets of , and the model is initialized with key frames as shown in Algorithm 1. In Algorithm 1, and are the positive frame set and the negative frame set, respectively. They are used to train the linear classifier . In order to initialize our model, we firstly compute , the feature of the -th frame which belongs to the first video sample in . Then the Euclidean distance between and the feature of the frames in other samples in the neighborhood of temporal position with radius in is computed. The frame which has the minimum Euclidean distance from in each sample is added in . Then is used to train the linear classifier and choose as the average of frame number in . To select the next key pose, chose with the minimum score based on for next loop; in other words, the -th frame which is most different from previous key pose is selected in the next loop. Finally, all and are trained with the linear SVM when Algorithm 1 is completed.

Require:
;
;
for do
  
  
  where is the pos-th frame of the first video in
  for do
   
   where pos –
  end for
  Train with pos_pose and neg_pose
  
  for do
   For each frame ,
  end for
  pos =
end for
 Training and with linear SVM

Once the initialization is finished, the model will be iteratively trained as follows. First, to find the optimal value subjected to where for each positive video example and update with the average value of all , the new linear classifier is trained with modified for each key pose. Second, (6) is optimized over , where with stochastic gradient descent. Thus, the models are modified to better capture skeleton characteristics for each action.

2.4. Action Recognition with Key Poses

The key technical issue in action recognition in real-world video is that we do not know where the action starts, and searching start position in all possible places takes a lot of time. Fortunately, the score of each possible start position can be computed, respectively. So a parallel tool such as OpenMP or CUDA might be helpful.

Given a test video with frames, first, the skeleton feature score of each frame has been computed in advance so we could reuse them later. Then for each possible action start position , we compute the score of each key pose according to the following equation: These scores are summed together as the final score of . If the final score is bigger than the threshold, then an action beginning at has been detected and recognized. Figure 3 shows key poses for different actions in Florence 3D Action dataset.

3. Experiment Result

This section presents all experimental results. First, trying to eliminate the noise generated by translation, scale, and rotation changes of skeleton poses, we preprocess the dataset with Procrustes analysis [22]. And we conduct the experiment for action recognition with or without Procrustes analysis on UTKinect dataset to demonstrate effectiveness of Procrustes analysis. Second, the appropriate feature extraction was selected from four existing feature extraction methods according to experimental result on Florence 3D Action dataset. Third, quantitative experiment is conducted to select the number of inconsecutive key poses. Last, we evaluate our model and compare it with some state-of-the-art method on three benchmark datasets: MSR Action 3D dataset, UTKinect Action dataset, and Florence 3D Action dataset.

3.1. Datasets

(1) Florence 3D Action Dataset. Florence 3D Action dataset [26] was collected at the University of Florence during 2012 and captured using a Kinect camera. It includes 9 activities; 10 subjects were asked to perform the above actions for two or three times. This resulted in a total of 215 activity samples. And each frame contains 15 skeleton joints.

(2) MSR Action 3D Dataset. MSR Action 3D dataset [11] consists of the skeleton data obtained by depth sensor similar to the Microsoft Kinect. The data was captured at a frame rate of 15 frames per second. Each action was performed by 10 subjects in an unconstrained way for two or three times. The set of actions included high arm wave, horizontal arm wave, hammer, hand catch, forward punch, high throw, draw , draw tick, draw circle, hand clap, two-hand wave, side boxing, forward kick, side kick, jogging, tennis swing, and tennis serve.

(3) UTKinect Action Dataset. UTKinect Action dataset [24] was captured using a single stationary Kinect and contains 10 actions. Each action is performed twice by 10 subjects in indoor setting. Three synchronized channels (RGB, depth, and skeleton) are recorded with a frame rate of 30 frames per second. The 10 actions are walk, sit down, stand up, pick up, carry, throw, push, pull, wave hands, and clap hands. It is a challenging dataset due to the huge variations in view point and high intraclass variations. So, this dataset is used to validate the effectiveness of Procrustes analysis [22].

3.2. Data Preprocessing with Procrustes Analysis

Skeleton data in each frame of a given video usually consists of a fixed number of predefined joints. The position of joint is determined by three coordinates . Figure 4 shows the skeleton definition in MSR Action 3D dataset. It contains 20 joints which could be represented by their coordinates. Regarding raw human skeleton in the video as the features is not a good choice in consideration of the nature of skeleton—rotation, scaling, and translation. So, before the experiment, we should normalize the datasets by Procrustes analysis.

In statistics, Procrustes analysis is a form of statistical shape analysis used to analyze the distribution of a set of shapes and is widely applied to the field of computer vision such as face detection. In this paper, it is used to align the skeleton joints and eliminate the noise owed to rotation, scaling, or translation. Details of Procrustes analysis will be depicted next.

Given a skeleton data with joints , the first step is to process the joints with translation transformation. We compute the mean coordinates of all joints and put them on the origin of coordinates. The translation is completed after each joint coordinate subtracting the mean coordinate, denoted as equation . The purpose of scaling is making mean square root of all joint coordinates equivalent to 1. For the skeleton joints, we compute according to the following equation: And the scaling result is calculated as follows: . The rotation of skeleton is the last step of Procrustes analysis. Removing the rotation is more complex, as standard reference orientation is not always available. Given is a group of standard skeleton joint points , which represent an action facing positive direction of x-coordinate axis. The mean coordinate of is put on the origin of coordinate and the mean square root of coordinate is 1. Then we compute the rotation matrix for skeleton which has been scaled and transformed as aforementioned method by (9), in which is 3 3 matrix. is the singular value decomposition with orthogonal and and diagonal . And the rotation matrix is equal to matrix multiplied by the matrix transform of  . At last, skeleton joint points can be aligned with through computing multiplied by .

We followed the cross-subject test setting of [30] on UTKinect dataset to test the validity of Procrustes analysis. Result is shown in Table 1. It is easy to see that the recognition rate of almost all actions is improved after preprocessing skeleton joint point with Procrustes analysis. In particular, the recognition rate of action is improved by . It turned out that the translation, scaling, and rotation of human action skeleton in the video affect the recognition accuracy and Procrustes analysis is an effective method to eliminate the influence of geometry transformation.

3.3. Feature Extraction Method Selection

With the deep research on action recognition based on skeleton, there are many efficient feature representations. We select four of them (Pairwise [1], the most informative sequences of joint angles (MIJA) [31], histograms of 3D joints (HOJ3D) [24], and sequence of the most informative joints (SMIJ) [13]) as alternative feature representations.

Given is a skeleton , in which . The Pairwise representation is computed as follows: for each joint , we extract the pairwise relative position features by taking the difference between the position of joint and the position of another joint : , so the feature of is . Due to the informativeness of the original joints, we made an improvement on this representation by concatenating and . Then the new feature is .

The most informative sequences of joint angles (MIJA) representation regards joint angle as features. The shape of trajectories of joints encodes local motion patterns for each action. It chooses to use 11 out of the 20 joints capturing information for an action and center the skeleton, using the hip center joint as the origin of the coordinate system. From this origin, vectors to the 3D position of each joint are calculated. For each vector, it computes the angle of its projection onto the x-z plane with the positive -axis and the angle between the vector and -axis. The feature consists of the 2 angles of each joint.

Histograms of 3D joints (HOJ3D) representation chooses 12 discriminative joints of 20 skeletal joints. It takes the hip center as the center of the reference coordinate system and defines -direction according to left and right hip. The remaining 8 joints are used to compute the 3D spatial histogram. The Spherical Coordinates space is partitioned to 84 bins. And for each joint location, a Gaussian weight function is used for the 3D bins. Counting the votes in each bin and concatenating them, we can get an 84-dimension feature vector.

Sequence of the most informative joints (SMIJ) representation also takes the joint angle as feature but it is different from MIJA. It partitions the joint angle time series of an action sequence into a number of congruent temporal segments and computes the variance of the joint angle time series of each joint over each temporal segment. The top 6 most variable joints in each temporal segment are selected to extract features with mapping function . Here is a function that maps a time series of scalar values to a single scalar value.

In order to find the optimal feature, we conduct an experiment on Florence 3D Action dataset, in which each video is short. And we estimate other 5 joints coordinates from original 15 joints of each frame in Florence dataset to make the same joints number of each frame as MSR Action 3D or UTKinect dataset. The experiment takes cross-subject test settings; one half of the dataset is used to train the key pose model and the other is used for testing. The model has 4 key poses and Procrustes analysis has been done before the feature extracting. Results are shown in Figure 5. The overall accuracy of Pairwise feature across 10 actions is better than SMIJ and MIJA. And it is observed that, for all actions except sit down and stand up, the Pairwise representation shows promising results. So, in following experiment, we select Pairwise feature to conduct action recognition experiment. The estimated joints coordinates generate more noise, so the accuracy is lower than the results on original Florence 3D Action dataset (shown in Table 6).

3.4. Selection of Key Pose Numbers

In this section, we implement some experiments to determine how many key poses are necessary for action recognition. The experimental results are shown in Figure 6; the horizontal axis denotes the number of key poses, and the vertical axis denotes recognition accuracy of the proposed approach. The number of key poses ranges from 1 to 20. We can see that the accuracy increases with the number of key poses when the number is less than 4. The accuracy almost achieves maximum values when the number of key poses equals 4, and the accuracy does not increase when the number of key poses is more than 4. To consider the accuracy and computation time, 4 is selected as the number of key poses for recognition action in our following experiment.

Table 2 only enumerates recognition accuracy for each action in UTKinect Action dataset when the number of key poses ranges from 4 to 8. It can be seen that the recognition accuracy varies with different key poses number for one action. However, the average recognition accuracy is nearly the same with different key poses number, so 4 is the high cost-effective choice.

3.5. Results on MSR Action 3D Dataset

According to the standard protocol provided by Li et al. [11], the dataset was divided into three subsets, shown in Table 3. AS1 and AS2 were intended to group actions with similar movement, while AS3 was intended to group complex actions together. For example, action is likely to be confused with in AS1 and action pickup & throw in AS3 is a composition of and high throw in AS1.

We evaluate our method using a cross-subject test setting: videos of 5 subjects were used to train our model and videos of other 5 subjects were used for test procedure. Table 4 illustrates results for AS1, AS2, and AS3. We compare our performance with Li et al. [11], Xia et al. [24], and Yang and Tian [25]. We can see that our algorithm achieves considerably higher recognition rate than Li et al. [11] in all the testing setups on AS1, AS2, and AS3. For AS2, the accuracy rate of the proposed method is the highest. For AS1 or AS3, our recognition rate is only slightly lower than Xia et al. [24] or Yang and Tian [25], respectively. However, the average accuracy of our method on all three subsets is higher than the other methods. Table 5 shows the results on MSR Action 3D dataset. The average accuracy of the proposed method achieves 90.94%. It is easy to see that our method performs better than the other six methods.

3.6. Results on UTKinect Action Dataset

On UTKinect dataset, we followed the cross-subject test setting of [30], in which one half of the subjects is used for training our model and the other is used to evaluate the model. And we compare our model with Xia et al. [24] and Gan and Chen [30]. Figure 7 summarizes the results of our model along with competing approaches on UTKinect dataset. We can see that our method achieves the best performance on three actions such as pull, push, and throw. And the most important thing is that the average accuracy of our method achieves 91.5% and is better than the other two methods (90.9% and 91.1% for Xia et al. [24] and Gan and Chen [30], resp.). The accuracy of actions such as clap hands and wave hands is not so good; the reason may be the fact that the skeleton joint movement ranges of these actions are not large enough and the skeleton data contain more noise. So, it hinders our method from finding the optimal key poses and degrades the accuracy.

3.7. Result on Florence 3D Actions Dataset

We follow the leave-one-actor-out protocol which is suggested by dataset collector on original Florence 3D Action dataset. All the sequences from 9 out of 10 subjects are used for training, while the remaining one is used for testing. For each subject, we repeat the procedure and average the 10 classification accuracy values at last. For comparison with other methods, average action recognition accuracy is also computed. The experimental results are shown in Table 6. In each column, the data represent each action’s recognition accuracy, while the corresponding subject is used for testing. The challenges of this dataset are the human-object interaction and the different ways of performing the same action. By analyzing the experiment result of our method, we can notice that the proposed approach obtains high accuracies for most of the actions. Our method overcomes the difficulty of intraclass variation such as bow and clap. The proposed approach gets lower accuracies for the actions such as answer the phone and read watch; this can be explained by the fact that these actions are human-object interaction with small range of motion and the Pairwise feature could not well reflect the motion. Furthermore, results compared with other methods are listed in Table 7. It is clear that our average accuracy is better than Seidenari et al. [15] and is the same as Devanne et al. [16].

4. Conclusion

In this paper, we presented an approach for action recognition based on skeleton by mining the key skeleton poses with latent SVM. Experimental results demonstrated that human actions can be recognized by only a few frames with key skeleton pose; in other words, a few inconsecutive and representative skeleton poses can describe the video action. Starting from feature extraction using the pairwise relative positions of the joints, the positions of key poses are found with the help of latent SVM. Then the model is iteratively trained with positive and negative video examples. In test procedure, a simple method is given by computing the score of each start position to recognize the action.

We validated our model on three benchmark datasets: MSR Action 3D dataset, UTKinect Action dataset, and Florence 3D Action dataset. Experimental results demonstrated that our method outperforms all other methods. Because our method relies on extracting descriptors of simple relative positions of the joints, its performance degrades when the actions are little varied and uninformative, for instance, those actions that were performed only by forearm gestures such as clap hands in UTKinect Action dataset. In the future, we will explore the other local features reflecting minor motion for better understanding human action.

Competing Interests

The authors declare that there are no competing interests regarding the publication of this paper.