Abstract

The human skeleton can be considered as a tree system of rigid bodies connected by bone joints. In recent researches, substantial progress has been made in both theories and experiments on skeleton-based action recognition. However, it is challenging to accurately represent the skeleton and precisely eliminate noisy skeletons from the action sequence. This paper proposes a novel skeletal representation, which is composed of two subfeatures to recognize human action: static features and dynamic features. First, to avoid scale variations from subject to subject, the orientations of the rigid bodies in a skeleton are employed to capture the scale-invariant spatial information of the skeleton. The static feature of the skeleton is defined as a combination of the orientations. Unlike previous orientation-based representations, the orientation of a rigid body in the skeleton is defined as the rotations between the rigid body and the coordinate axes in three-dimensional space. Each rotation is mapped to the special orthogonal group . Next, the rigid-body motions between the skeleton and its previous skeletons are utilized to capture the temporal information of the skeleton. The dynamic feature of the skeleton is defined as a combination of the motions. Similarly, the motions are represented as points in the special Euclidean group . Therefore, the proposed skeleton representation lies in the Lie group , , which is a manifold. Using the proposed representation, an action can be considered as a series of points in this Lie group. Then, to recognize human action more accurately, a new pattern-growth algorithm named MinP-PrefixSpan is proposed to mine the key-skeleton-patterns from the training dataset. Because the algorithm reduces the number of new patterns in each growth step, it is more efficient than the PrefixSpan algorithm. Finally, the key-skeleton-patterns are used to discover the most informative skeleton sequences of each action (skeleton sequence). Our approach achieves accuracies of 94.70%, 98.87%, and 95.01% on three action datasets, outperforming other relative action recognition approaches, including LieNet, Lie group, Grassmann manifold, and Graph-based model.

1. Introduction

Human action recognition is currently the most dynamic research topic in the field of computer vision, owing to its applications in intelligent surveillance, video games, robotics, and other fields. Several approaches have been proposed to recognize human action from RGB video sequences over the past few decades [1], but their performance is unsatisfactory because RGB data are very sensitive to factors such as perspective changes, occlusions, and background clutter. Although significant research results have been achieved, human action recognition remains a challenging problem.

Because the human skeleton can generally be regarded as an articulated system of rigid segments, which are connected by joints, human action can be viewed as a continuous evolution of the spatial configuration, which is constructed by these rigid segments [2]. Therefore, if human skeleton sequences can be accurately extracted from RGB videos, action recognition can be performed by classifying these sequences. However, it is very difficult to accurately extract a skeleton sequence from RGB videos [3]. With the advent of cost-effective RGB-D cameras, it has become easier to extract the three-dimensional (3D) human skeleton from depth maps. Although this improves the appearance and viewpoint variations to a certain extent [47], the following two challenges cause large intraclass variations and remain unresolved. First, different people can perform the same action in different ways. Second, the 3D human skeleton is sometimes imprecise because depth maps include noisy information. However, a psychological research found that humans can easily recognize an action from a pose sequence [8]. According to the work of [8], Yang et al. considered that actions can be classified by a single key pose [9]. This suggests that a set of key skeletons can be used to perform action classification rather than the entire skeleton sequence. Since the representation of the key poses is robust to outlier poses, this approach should improve the accuracy of action recognition as long as the key poses are accurate.

The general framework of the proposed approach is shown in Figure 1. By observing human action in daily life, the orientations and motions of rigid bodies can include a lot of useful information for action recognition. In this paper, a new skeletal representation, which is composed of the static feature and dynamic feature, is proposed for 3D skeleton-based action recognition. The static feature is used to represent the spatial information in a given skeleton t. To capture the scale-invariant spatial information, the orientations of the rigid bodies in the skeleton are employed to construct its static feature. In this work, the orientation of a rigid body in the skeleton is represented as six rotation matrices between the rigid body and the three coordinate axes in 3D space. The rotation matrices are mapped to the special orthogonal group [10]. Next, the dynamic feature is employed to represent the temporal information of skeleton t. The dynamic feature is composed of the rigid-body motions between skeletons t and t-1 and those between skeletons t and 1 (the three skeletons belong to the same sequence or action). The motions are represented as points in the special Euclidean group [11]. Hence, skeleton t is represented by a point in Lie Group ,where the operation represents the direct product between groups in group theory. Using the proposed skeleton representation, a human action (skeleton sequence) can be represented as points in the Lie group. However, it is typically a very complicated task to classify human actions represented by a Lie group directly. Many standard classification approaches, such as the support vector machine (SVM)[12] approach, are not directly applicable to Lie groups. To overcome the classification difficulties, the actions (skeleton sequences) are mapped from the Lie Group to its Lie algebra , which are the elements of the tangent space of the manifold at the identity element. The Lie algebra is a vector space, which makes action classification easier.

An action (skeleton sequence) usually includes many noisy skeletons, which can reduce the action recognition accuracy. In this study, the key-skeleton-patterns are used to eliminate noisy skeletons from an action, and the remaining skeletons in the action are called the most informative skeleton sequence. First, a pattern is defined as a short skeleton sequence, which is not necessarily adjacent in the original skeleton sequences. If the short skeleton sequence appears in many skeleton sequences of an action class, the pattern is called the key-skeleton-pattern in that class. Next, to mine the key-skeleton-patterns, k-means is used to learn the symbolic dictionary from all skeletons in the dataset. Each symbol in the dictionary represents a class of similar skeletons, which means that each skeleton is quantized (represented) by a symbol in the dictionary. Then, a skeleton sequence can be represented as a symbol sequence. In this paper, probability is used to measure the distance between a skeleton and its corresponding symbol in order to minimize the effect of quantization errors (e.g., two different skeletons are quantized by the same symbol). Hence, each skeleton is represented by a distance-based probability, and an action is represented as a probability sequence. Then, a new pattern-growth algorithm named MinP-PrefixSpan is proposed to mine the key-skeleton-patterns of an action class from the symbol sequences and the probability sequences that correspond to the action class. Compared with the PrefixSpan algorithm, our algorithm achieves higher efficiency by reducing the number of new skeleton patterns in each growth step. Finally, the key-skeleton-patterns are utilized to eliminate noisy skeletons from the action in order to capture the most informative skeleton sequence of the action. An SVM is employed to classify the most informative skeleton sequences.

The main contributions of this study are as follows. (1)To capture the scale-invariant skeletal information, the orientations of rigid bodies in a skeleton are utilized to construct the static feature. Different from previous orientation-based approaches, in this study, a rigid-body orientation is represented as six rotation matrices, and each rotation matrix is represented as a point in . (2) Traditional approaches based on Lie groups [5, 13, 14] only consider the spatial information of a skeleton but ignore the temporal information between different skeletons. Therefore, our approach employs the rigid-body motions between different skeletons to describe the temporal variation. Likewise, the motions can be represented as points on . (3) Traditional approaches also ignore the influence of noisy skeletons in an action on the accuracy of the action recognition. In this study, based on the PrefixSpan algorithm [15] in data mining, a new pattern-growth algorithm is proposed to mine the key-skeleton-patterns of each action class, and the key-skeleton-patterns are used to eliminate noisy skeletons.

A brief overview of the related work on human action recognition approaches based on skeletons is provided in this section, and various sequential pattern-mining algorithms are reviewed.

The existing skeleton-based action recognition approaches can be classified into three main categories. The first class of approach ignores the influence of noisy skeletons on the accuracy of action recognition. Slama et al. represented an action by an observability matrix, which was characterized by an element of a finite Grassmann manifold [16]. However, their method does not eliminate noisy skeletons from an action, and it is insufficient to estimate the approximation of an extended observability sequence with a finite Grassmann manifold. Ding et al. divided actions into subactions and used the profile hidden Markov model(HMM) to align them [13]. Although their approach accurately extracts the spatial features of an action, it does not solve the following two problems: eliminating noisy skeletons and reducing the time complexity of the profile HMMs. Liu et al. proposed a new spatiotemporal representation, called "Skepxels," to transform skeleton videos into images of flexible dimensions, and employed the resulting images to build a CNN-based framework for effective human action recognition[17]. Likewise, their approach does not eliminate noisy skeletons from an action. In this study, the key-skeleton-patterns of an action are utilized to eliminate noisy skeletons from the action as an approach to improve the accuracy of action recognition.

The second class of approach ignores scale variations from subject to subject, which means that the spatial feature of an action cannot be accurately represented. Chaudhry et al. hierarchically divided the human skeleton into smaller parts and employed certain bio-inspired shape features to represent each part [18]. The temporal evolutions of these bio-inspired features are modeled by linear dynamical systems (LDSs). Although their approach takes full advantage of the correlation between the skeletal parts, it ignores the feature of the rigid bodies in a skeleton and the scale variations between different subjects. Xia et al. proposed a view-invariant representation of the human skeleton using histograms of 3D joint locations [19]. The temporal evolutions of this skeletal representation are modeled by a discrete HMM. However, their approach not only ignores the relativity between the rigid bodes in a skeleton but also the normalization of the skeleton data. Li et al. represented an action by a special graph based on the top-K relative variance of joint relative distance (RVJRD) [20]. One potential limitation of this approach is that the graph-based model does not handle scale variations, which may cause incorrect spatial information to be selected by the top-K RVJRD. In contrast, our proposed approach uses the orientations of the rigid bodies in a skeleton to capture scale-invariant skeletal features.

The third class of approach ignores the temporal information of an action and treats the poses in the action independently. Evangelidis et al. used a local skeleton descriptor to encode the relative positions of joint quadruples [21]. The descriptor of an action was represented by a multilevel Fisher vector composed of the local skeleton descriptor in the action. However, the action descriptor not only ignores the temporal information between different skeletons but also has high time complexity. Huang et al. combined the Lie group structure with a deep network framework [22]. Their learning structure (LieNet) has a rotation mapping layer transforming the Lie group features into the traditional neural network model. One main limitation of this approach is that LieNet ignores the rich temporal information of human actions. Vemulapalli et al. described the relative geometry between the rigid-body parts using special Euclidean group [5]. Therefore, the entire skeleton in an action can be represented as a point in . An action is represented as a curve in the Lie group . Although their approach can accurately extract the spatial information of a skeleton, it ignores the temporal cues between the skeletons in an action and does not eliminate noisy skeletons from the action. Our proposed dynamic feature models the temporal structures of an action using the rigid-body motions between different skeletons in the action.

Sequential pattern mining aims to discover frequent subsequences as patterns in a sequence database. Traditional sequential pattern mining algorithms [2326] are usually used to mine frequent sequential patterns from deterministic databases. However, those approaches cannot be indirectly applied to uncertain data (or probabilistic data). Unfortunately, the existing pattern mining algorithm on an uncertain dataset [27, 28] is not adopted to our probabilistic sequence model. Therefore, considering the amount of noise in our uncertain datasets (probabilistic datasets), a new pattern-growth algorithm is proposed to mine the key-skeleton-patterns from the datasets.

3. Proposed Framework

3.1. Fundamental Concepts

In this subsection, a brief overview of the special Euclidean group and the special orthogonal group is presented, which is necessary for further understanding of the Lie group. We refer the readers to [2, 10, 11] for a general introduction to Lie groups. Important notations are shown in Table 1.

3.1.1. Special Orthogonal Group

The special orthogonal group is a Lie group, which can be represented by all 33 orthogonal matrices shown as follows:where denotes the identity matrix and A is a rotation matrix. In 3D space, a rotation A is an element of and can transform a vector to by

Every group has an associated Lie algebra of that is the tangent space around the identity element . The Lie algebra of , denoted by , is a set of all real skew-symmetric matrices as follows: Given an element its vector form is The exponential map from to and the logarithm map from to , respectively, are

3.1.2. Special Euclidean Group

The special Euclidean group SE is a Lie group, which is a set of 4 by 4 matrices The matrix representation also makes SE action on points by rotating and translating them:

Every Lie Group SE can be associated with a Lie algebra , which is the tangent space of the Lie group SE at the identity matrix . Note that is a 6D vector space that can be formed by 4 by 4 matrices Given an element its vector form is The exponential map from to SE and the logarithm map from SE to , respectively, are

and : the direct product is used to combine multiple , which form a new Lie group with identity element and its Lie algebra . The exponential of and the logarithm map of , respectively, are given byThe vector form of is Similarly, a new Lie group with identity element and its Lie algebra are formed by using the direct product . The exponential map of and the logarithm map of , respectively, are given by The vector form of is

3.1.3. Explanation of Fundamental Concepts

According to the concepts described in Section 3.2.1, the orientation of a rigid body is represented by six rotation matrices. Mathematically, a rotation matrix is a point in ; therefore, the orientation of a rigid body in skeleton can be represented as six points in . Then the static feature, composed of the orientations of the rigid bodies in the skeleton, is represented as a point in the Lie group , as shown in Figure 1.

The motions of a rigid body is generally regarded as its rotations and translations in 3D space. Mathematically, the rotations and translations of a rigid body are defined as ; therefore, a rigid-body motion between skeletons and (or skeletons and 1) can be represented as a point in . Then, the dynamic feature, which is composed of the rigid-body motions between skeletons and and those between skeletons and 1, is represented as a point in the Lie group , as shown in Figure 1. A skeletal representation, composed of the static feature and dynamic feature, can be represent as a point in the Lie group .

The wavy surface represents a Lie group in Figure 1. A whole circle in the wavy surface represents an action (skeleton sequence). Each black dot in the circle represents a skeleton. Then, an action can be represented as points in the Lie group (the points are included in the same circle). To overcome the classification difficulties, an action (or a whole circle) is mapped from the Lie group to its Lie algebra , as shown in Figure 1. In fact, the Lie algebra is a vector space.

3.2. Extraction of Skeleton Features

In this subsection, the static and dynamic features of a skeleton are represented as a point in the Lie group. Let be a skeleton. The set of bone joints is denoted by , and the set of rigid-body parts is denoted by , where . Figure 2(a) shows an example of the human skeleton with 19 rigid-body parts and 20 bone joints.

3.2.1. Static Feature of Skeleton

By observing human action, the orientations of rigid bodies in a skeleton (pose) can include a lot of valuable information for action recognition. To describe the orientation of a given rigid body , the global coordinate system is translated to the local coordinate system. , , and represent the three rotations that transform the rigid body to the three coordinate axes, as shown in Figure 3(c). Their rotation relationship is shown as follows: where , , and . Similarly, , , and represent the three rotations that transform , , and to the rigid body , respectively, as shown in Figure 3(d). Their rotation relationship is shown as follows: where , , and . The six rotations can be used to describe the orientation of the rigid bodies.

Given a skeleton , is used to represent the orientation of in the skeleton. In this work, the skeletal is defined as a set of the orientations of the rigid bodies in the skeleton as follows: where M is the total number of rigid bodies in the human skeleton.

3.2.2. Dynamic Feature of Skeleton

Rigid-body motion is generally regarded as rotations and translations in 3D space. Mathematically, the rotations and translations of a rigid body can be denoted by . In this study, is employed to describe rigid-body motions between different skeletons. Let be rigid body in skeleton i and be rigid body in skeleton j().

Given a point and corresponding to , we have where and and are the rotation and translation, which can transform to the position and orientation of , respectively, as shown in Figure 4(b).

Similarly, given a point and corresponding to , we have where and and are the rotation and translation, which can transform to the position and orientation of , respectively, as shown in Figure 4(c). is used to represent the motion of rigid body between skeletons i and j. Then, the rigid-body motions between skeletons and can be represented bywhere M is the total number of the rigid bodies in the skeleton.

In this study, our approach only considers the rigid-body motions between skeletons t and t-1 and those between skeletons t and 1. According to formula (24), the rigid-body motions between skeletons t and t-1 can be represented by . Similarly, the rigid-body motions between skeletons t and 1 can be represented by . Then, the skeletal is defined as a set of the rigid-body motions in skeleton as follows:

Skeleton t is represented by the static feature and the dynamic feature as follows: where represents the static feature of the skeleton and represents its dynamic feature.

3.3. Skeleton Sequence Representation
3.3.1. Lie Group Representation of Skeleton Sequence

Using the proposed skeletal feature, a skeleton sequence or an action is represented by where T is the total number of frames in the sequence and Lie group .

3.3.2. Lie Algebra Representation of Skeleton Sequence

Since the most classification methods (such as SVM) cannot be directly applied to manifolds, to overcome these difficulties, is mapped to its Lie algebra ().The Lie algebra of is given by and the Lie algebra of is given by The Lie algebra of is given by . A human action can be represented by the following Lie algebra structure: where T is the total number of frames in the sequence.

M is the total number of rigid bodies in skeleton . Given a rigid body in skeleton , is the Lie algebra’s representation of the orientation of the rigid body, which is -dimensional vector(). is the Lie algebra’s representation of the static feature of skeleton , which is a -dimensional vector. is the Lie algebra’s representation of the motions of rigid body between skeletons and , which is -dimensional vector. is the Lie algebra’s representation of the rigid-body motions between skeletons and -1, which is a -dimensional vector. is the Lie algebra’s representation of the rigid-body motions between skeletons and , which is a -dimensional vector. is the Lie algebra’s representation of the dynamic feature of skeleton , which is a -dimensional vector. is the Lie algebra’s representation of skeleton t, which is a -dimensional vector. Hence, a human action can be seen as temporal evolutions of a -dimensional vector.

3.4. Key-Skeleton-Pattern Mining

In the previous subsections, a skeleton sequence is represented as the Lie algebra structure , where T is the total number of frames in a skeleton sequence. However, a skeleton sequence can include many noisy skeletons, which reduce the accuracy and efficiency of the action recognition. In this subsection, the key-skeleton-patterns are used to eliminate noisy skeletons in a skeleton sequence in order to capture the most informative skeleton sequences.

3.4.1. Formal Definitions

To mine the key-skeleton-patterns, classic k-means is used to quantize all skeletons represented by the Lie algebra to K symbol. Let be a set containing K symbol and be a set of centroid. Then, a skeleton sequence can be represented as a symbol sequence . Since different skeletons may be quantized as the same symbol, to minimize the effect of quantization errors, each skeleton in a sequence is represented by probability , which is used to measure the distance between skeleton and centroid as follows:where correspond to symbol . Equation (31) shows the distance inversely proportional to . Now, a skeleton sequence also can be represented by a probability sequence .

Definitions. Some terms are defined in this paper as follows (Important notations are in Table 2.).

Definition 1 (pattern). is a sequence that contains m symbols chosen from the dictionary, i.e.,

Definition 2 (mining sequence). is a mining sequence applied to mine the Key-skeleton sequence. is a probability sequence, which represents a skeletons sequence. is the symbol sequence, which corresponds to . is a skeleton sequence represented by the Lie algebraic structure.

Definition 3 (projected dataset). Given a pattern and a mining sequence dataset of an action class, the -projected dataset is defined by the .

Definition 4 (support). For a pattern and a symbol sequence (where is an element of ), let be an indicator variable with value 1 if is a subsequence of the symbol sequence , and 0 otherwise. For any pattern , its support in is denoted by

Definition 5 (expected support). Given a pattern and a symbol sequence (where is an element of ), let be the positions where the pattern takes up in and let be the product of the probability. For any pattern , its expected support in is denoted by .

Definition 6 (key-skeleton pattern). Given a pattern and a mining sequence dataset of an action class, if is larger than a threshold and is larger than a threshold , then the pattern is called a key-skeleton-pattern of that action class. A key-skeleton-pattern of length is called an -pattern.

3.4.2. MinP-PrefixSpan Algorithm

In this subsection, a new pattern-growth algorithm, called MinP-PrefixSpan, is proposed to mine the key-skeleton-patterns of an action class by searching over the enormous space of the symbol sequences and probability sequences of the action class. The algorithm is shown in Algorithm 1. In Lines 2-8, the dataset is employed to construct a new projected dataset . In Lines 9-11, if is the key-skeleton-pattern, pattern is appended to and symbol table is constructed by . In Line 12, MinP-PrefixSpan is recursively called to grow the key-skeleton-pattern until all key-sequence patterns are found.

Input:projected dataset , symbol table , key-skeleton-pattern dataset
(1)  for each symbol do
(2)   
(3)   for each do
(4)    
(5)    if then
(6)     Append to
(7)    end if
(8)   end for
(9)   if and then
(10)   
(11)   Append to
(12)   MinP-PrefixSpan(,)
(13)  end if
(14)  Free and
(15) end for

In Line 10, the trim algorithm is used to improve the efficiency of the MinP-PrefixSpan algorithm by eliminating nonkey-skeleton-patterns. Algorithm 2 shows the implementation details of the trim algorithm. The trim algorithm mainly consists of the following two parts:

Input: symbol table , projected dataset
Output: symbol table
(1)
(2) for each symbol do
(3)  if and then
(4)   
(5)  end if
(6) end for

Given a mining sequence dataset of an action class and a pattern (according to Definition 3, is the projected dataset), two rules are proposed to trim nonkey-skeleton-patterns as follows:(1)If , then pattern is a non-key-skeleton-pattern.(2)If , then pattern is a non-key-skeleton-pattern.

Referring to the pattern-growth method of Prefxispan, one symbol is used to grow key-skeleton-pattern and check the support and expected support of the pattern . A symbol table is used to store each symbol in order to reduce the number of new skeleton patterns in each growth step. An important property is found between symbol tables.

Property 7 (symbol table). If a key-skeleton-pattern grows from , then .

Proof. Let us denote as the key-skeleton-pattern and . Suppose and also is a key-skeleton-pattern on . Since according to Definition 4, . Since according to Definition 5, . We conclude that belongs to ,which implies that .

3.5. Discovering the Most Informative Skeleton Sequence

The task of Algorithm 3 is to discover the most informative skeleton sequences for all actions. Let be a mining sequence dataset of all actions. is a dataset used to store the key-skeleton-patterns of all action classes, and is a dataset used to store the most informative skeleton sequences of all actions. In Lines 2-11, the key-skeleton-patterns of each class action are mined from training dataset and appended to dataset . In Lines 12-22, the key-skeleton-patterns in dataset are employed to discover the most informative skeleton sequence of the actions in dataset , and all most informative skeleton sequences are stored in dataset (refer to Figure 5).

Input: key-skeleton-pattern dataset , mining sequence dataset of all actions
Output: The dataset of the most informative skeleton sequences
(1)  Obtaining training dataset from
(2)  ;
(3)  for the dataset of each action class U do
(4)   T is the table that includes all symbols in
(5)   for each symbol do
(6)    if and then
(7)     
(8)     append to K
(9)     MinP-PrefixSpan()
(10)   end if
(11)  end for
(12) end for
(13) for each element do
(14)  for each key-skeleton-pattern   do
(15)   if m is a subsequence of d.L then
(16)    
(17)    
(18)    fordo
(19)     
(20)    end for
(21)   end if
(22)  end for
(23)  for z=1 to Len(d.L) do
(24)   if patternpos[i]==1 then
(25)    append d.LA[z] to s
(26)   end if
(27)  end for
(28)  Append s to
(29) end for
(30) return

Dynamic Time Warping (DTW)[33] has excellent performance in searching for an optimal alignment between time sequences. Therefore, for each action class, our model uses the action standardization algorithm proposed by the author of [5] to compute a nominal action and employs DTW to warp all the training or testing actions into this nominal action. SVMs are extensively used in computer vision to achieve excellent performances in image and video classifications. To achieve better classification results, a linear SVM is used to classify the most informative skeleton sequences.

3.6. Datasets

In this study, three standard 3D human action datasets are employed to study the effectiveness of the proposed method.

[34] can be captured using a depth camera similar to the Kinect device. This dataset consists of 20 actions of 10 subjects, with each action having two or three repetitions. In total, there are 557 action sequences. The dataset provides 3D locations of 20 joints. The horizontal and vertical locations of each skeleton joint are stored in the screen coordinates, and the skeleton’s depth position is stored in the global coordinates. Human actions in this dataset capture various types of motions, which are related to arms, legs, torso, and their combinations. Experiments on this dataset are challenging, but the dataset is widely applied to test the accuracy and robustness of recognition methods for various actions.

[19] is captured using a stationary Kinect sensor. It consists of 10 human actions obtained from daily life: walking, sitting down, standing up, picking up, carrying, throwing, pushing, pulling, waving, and clapping hands. Each human action is performed by 10 different subjects (nine males and one female) twice or thrice. In total, there are 199 action sequences. This dataset is very challenging. First, for some action sequences, parts of the human body are invisible because the body parts are out of the field of view. Second, subjects performed the same action using different limbs, such as waving the left hand and waving the right hand. Third, it is very difficult to capture the action sequences with invariance to the view point.

[23] consists of 663 sequences of 20 gaming actions captured by Kinect. Each actor performed each gaming action more than two twice. Although the dataset can provide synchronized video, depth, and skeleton data, skeleton data is only chosen in our experiment. The dataset is challenging because of the following two aspects: (1) if the body parts are occluded, the Kinect device gives inferred results, which may reduce the accuracy of the action recognition. (2) if two different actions have very small interclass variations, the two actions may easily interference with each other in the action recognition.

4. Experimental Results

The skeleton preprocess is as follows: a human action is composed of a continuous evolution of a series of skeletons. To make each skeleton view-invariant, all 3D joint coordinates in the skeleton are transformed to the coordinate system, which places the hip center at the origin. The entire skeleton will stop rotation until the global x-axis is aligned with the ground plane projection of the vector from the left hip to the right hip (refer to Figure 2(b)).

4.1. Experiments on the MSRAction3D Dataset

Following the experimental protocol of [4], 20 actions in the MSRAction3D dataset are divided into three subsets , , and , each including eight actions. AS1 and AS2 include actions with similar movement. AS3 groups include more complex actions. A half of the subjects are chosen for training, and the remaining subjects for testing. The experiment is run on ten different combinations of training and testing sets, and the mean performance is reported. Figure 6 shows that our approach outperforms various other representations. Our approach achieves a mean accuracy of 94.58% on the MSRAction3D dataset, outperforming other action recognition approaches, including Bag of 3D Points [4], Eigenjoints [29], and Lie group [5], which achieved accuracies of 74.7, 83.3, and 91.88%, respectively. Our approach performs better than the others both in distinguishing similar actions and in recognizing complex actions. This is mainly because the informative skeleton sequences, represented by the Lie group, are used to train SVM classifiers.

Following the experimental protocol of [16], the dataset containing all actions is tested. The experimental setting is more challenging than that of [4]. Our approach achieves an accuracy of 97.4%, outperforming other relative action recognition approaches, including Grassmann manifold [16], graph-based model [20], and Lie group [5], which achieved accuracies of 91.21, 92.2, and 92.46%, respectively, as shown in Table 3.

Figure 7 shows the classification confusion matrix on the whole MSR-Action3D dataset. Most actions on the dataset can be correctly recognized by our approach, but classification errors occurred if two actions were extremely similar, such as draw tick and draw  .

Matlab is used to run the experiments on a 3.60GHz Intel Core i7-4790 CPU machine. The average testing time of one action sequence in the dataset only costs 35.1ms, which is lower than that of Lie group(72.5ms).The reason is that the skeleton feature dimension of our approach(798-dimension) is lower than that of Lie Group(2052-dimension). However, since the authors of Grassmann Manifold and Graph-based model do not open the source code of their approaches, the average testing time of their approaches cannot be obtained.

4.2. Experiments on the UCKinect-Action Dataset

The recognition rate of our approach on the UCKinect Dataset is 98.87%. Our method outperforms the Lie group[5], Grassmann manifold[16], Eigenjoints[29], and learning feature combination[30], which achieved recognition rates of 97.08, 97.91,97.1, and 98.00%, respectively, as shown in Table 4.

The average testing time of one action sequence in the dataset costs 33.6ms, which is lower than that of Lie group (69.2ms) but higher than that of Learning features combination (13.7ms). The reason is that the skeleton feature dimension of our approach (798-dimension) is lower than that of Lie Group (2052-dimension) but higher than that Learning features combination (256-dimension). Unfortunately, the average testing time of Grassmann Manifold and Eigenjoints cannot be obtained without the source code of the approaches.

4.3. Experiments on the G3D-Gaming Dataset

The cross-subject test setting, in which half of subjects were used for training and the remaining subjects were used for testing, is used to perform recognition on the data. Table 5 compares our approach with other approaches on the dataset (GB-RBM+HMM [21] and LieNet [17] used a deep-learning method to recognize human action). Our approach achieves a higher recognition rate.

The average testing time of one action sequence in the dataset only costs 34.8ms, which is lower than that of Lie group(71.2ms) and (58.9ms). The reason is that the skeleton feature dimension of our approach (798-dimension) is lower than that of Lie Group(2052-dimension) and (1026-dimension). However, the authors of tLDS do not open the source code of their approach, the average testing time of their approach cannot be obtained. Since the deep learning-based approaches usually use GPU to accelerate their models while the nondeep learning-based approaches usually use CPU to perform their experiments, it is hard to implement a fair comparison between these two classes of approaches (our approach belongs to the nondeep learning-based approaches, and LieNet and GB-RBM+HMM belong to the deep learning-based approaches).

5. Conclusion and Future Work

This paper proposes a new skeleton-based action representation, which consists of static and dynamic features. First, the orientation of a rigid body is regarded as six rotation matrices, and each rotation matrix is represented as a point in . The rigid-body orientations in a skeleton are used to construct the static feature in order to avoid dealing with skeletal scale variations. Second, the motions of rigid bodies, represented by , are used to construct the dynamic features in order to capture the temporal information of the skeleton. Finally, based on the proposed representation, the key-skeleton-patterns are employed to discover the most informative skeleton sequences. The experiment results show that our approach achieves better performance than other state-of-the-art skeleton-based action recognition approaches. Further research should combine the Lie group with a linear dynamical system to model human actions as a tensor time series.

Data Availability

In this article, we performed our experiments in the three public Datasets as follows. (1)MSR-Action3D Dataset is an action Dataset of depth sequences captured by a depth camera. The Dataset can be found in http://research.microsoft.com/en-us/um/people/zliu/actionrecorsrc/. (2)UTKinect-Action3D Dataset was collected as part of research work on action recognition from depth sequences. The Dataset can be found in http://cvrc.ece.utexas.edu/KinectDatasets/HOJ3D.html. (3) G3D: Dataset contained synchronised video, depth and skeleton data. The Dataset can be found in http:// dipersec.king.ac.uk/G3D/ or search the three action recognition dataset https://github.com/liguang1980/Action-recognition-Datasets.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Grant no. 61571345, the Fundamental Research Funds for the Central Universities under Grant no. K5051203005, the National Natural Science Foundation of China under Grant no. 91538101, the National Natural Science Foundation of China under Grant no. 61850410523, Huawei Innovation Research Program under Grant no. 2017050310, the Fundamental Research Funds for Xidian University no. XJS18041, and the Natural Science Foundation of the Anhui Higher Education Institutions of China no. KJ2017A376.