Abstract

This paper addresses the problem of predicting human actions in depth videos. Due to the complex spatiotemporal structure of human actions, it is difficult to infer ongoing human actions before they are fully executed. To handle this challenging issue, we first propose two new depth-based features called pairwise relative joint orientations (PRJOs) and depth patch motion maps (DPMMs) to represent the relative movements between each pair of joints and human-object interactions, respectively. The two proposed depth-based features are suitable for recognizing and predicting human actions in real-time fashion. Then, we propose a regression-based learning approach with a group sparsity inducing regularizer to learn action predictor based on the combination of PRJOs and DPMMs for a sparse set of joints. Experimental results on benchmark datasets have demonstrated that our proposed approach significantly outperforms existing methods for real-time human action recognition and prediction from depth data.

1. Introduction

Predicting ongoing human actions based on incomplete observations plays an important role in many real-world applications such as surveillance, clinical monitoring, and human-robot interaction. Despite significant research efforts in the past decade, it is still a challenging task to represent human actions for action prediction due to the complex articulated essence of human movements performed under a variety of scenarios. In addition, some actions may include human-object interactions in the environment, which increases the difficulty of action representation.

Recently introduced cost-effective depth cameras largely ease the task of action representation due to the availability of 3D joint locations of human skeleton and depth map data describing actions. It has already run into a common view that knowing the 3D joint locations is helpful for describing the articulated nature of human actions. With the 3D locations of skeletal joints, skeletal action representation can be performed by characterizing their variations over time. The skeletal action representation has resulted in an interest in skeletal human action prediction.

Most of existing skeletal action prediction methods focus on predicting human actions using orientation of joint movements. However, these skeletal features model actions simply as the motion of individual joints, which is limited in capturing complex spatiotemporal relations among joints. Moreover, it is insufficient to use the 3D joint locations without local appearance to fully model a human action, especially when the action involves the interactions between human and external objects. Although many appearance features extracted from depth map data have been proposed in recent years, these features do not provide real-time processing times.

This paper presents a novel action prediction approach with a depth camera in real-time fashion. The flowchart of the proposed approach is illustrated in Figure 1 (left panel). We first propose two new depth-based features called pairwise relative joint orientations (PRJOs) and depth patch motion maps (DPMMs) extracted from skeletal and depth map data. The PRJOs and DPMMs are used to represent the relative movement between each pair of joints and local depth appearance of interactions between human and environmental objects over the duration of a human action. These two features complement each other as a bundle for each individual joints and are suitable for real-time prediction. Then, we associate these two features for each individual joint as a bundle and propose the sparse regression-based learning model which utilizes the group sparsity to select the associated features of active joints for each action class and utilize them to learn predictor for real-time action prediction.

Our main contributions include three aspects: (1) We propose a group sparse regression-based learning model as a new way to learn action predictor using selected discriminative features for different action classes. (2) We propose a skeletal feature called pairwise relative joint orientations (PRJOs) to describe the relative movement between each pair of joints. Different from other existing skeletal features for real-time action prediction, the PRJTs can encode the complex spatiotemporal relations among joints. (3) We propose a depth appearance-based feature called depth patch motion maps (DPMMs) to characterize human-object interactions. The DPMMs are more efficiently computed than other common appearance features.

After a brief review of the related work in Section 2, the two depth-based features are described in Section 3. Section 4 presents the group sparse regression-based learning model and its learning method. Section 5 presents the experimental evaluations. The conclusions are provided in Section 6.

We first review human action prediction methods based on RGB data. Then, we review existing feature representations extracted from depth videos.

2.1. Action Prediction in RGB Videos

Recent efforts on human action prediction are mainly focusing on predicting actions based on RGB videos. Hoai et al. [1] introduced an online Conditional Random Field method for human intent prediction. Ryoo et al. [2] presented a dynamic Bag-of-Words (BoW) method for action prediction. In this method, the entire BoW sequence is divided into subsegments to find the structural similarity between them. Based on a Nave–Bayes-Nearest-Neighbor classifier, Yang et al. [3] proposed an action classification approach which can achieve similar levels of accuracy after seeing only 15-20 frames of an action sequence as opposed to the full action observation. This method is in essence used to predict actions. Ryoo et al. [4] designed a method for early recognition of human actions from streaming videos. Wang et al. [5] developed a Markov-based method for early prediction of human actions, aimed at human-robot interaction. Cao et al. [6] predicted actions from unfinished videos based on a set of completely observed training video action samples. Kong et al. [7, 8] extended the SVM and built multiple temporal scale templates to predict actions. Xu et al. [9] intended to mine discriminative patches to autocomplete partial videos for action prediction. Kitani et al. [10] predicted destinations of pedestrian based on semantic scene understanding method. Li et al. [11] performed action prediction through capturing the causal relationships between constituent actions and the predictable characteristics of actions. Walker et al. [12] introduced an unsupervised approach to predict the possible change of scene with time. These methods predict human actions from RGB sequences. Although they have made significant advance in action prediction, they cannot capture rich spatiotemporal information of actions very well due to the limitation in capturing highly articulated motions from RGB data.

2.2. Action Analysis in Depth Videos

Recently, action analysis with depth cameras have attracted significant attention from many researchers. In the literature, how to mine a powerful depth-based feature representation for action analysis is one of the most fundamental research topics [1316]. Depth-based features can be classified into two major classes. The first are skeletal features, which extract information from the provided 3D locations of joints on each frame of the depth sequence. The skeletal features make it easier to represent an articulated motion as a set of movements of body parts according to locations of joints. However, most existing skeletal features are extracted for action classification. Although Reily et al. [17] proposed skeletal features for action prediction, their skeletal features cannot model the complex structure among joints in motion. The readers are referred to [18] for a systematic review of action analysis methods based on skeletal representation, respectively. The other group consists of depth appearance features which are extracted proposed directly from depth map data. A lot of depth appearance features have been proposed in recent years [1923]. These features mainly focus on off-line computation. Different from the previous depth-based features, in this paper, we propose the pairwise relative joint orientations (PRJOs) and depth patch motion maps (DPMMs) to characterize the spatiotemporal relations among joints and the depth appearance of human-object interaction for real-time action prediction. Moreover, we associate the PRJOs and DPMMs into different feature groups according to different joints and learn the group weights based on group spare regulation, which was not considered in the previous work. The resulting group sparse weight matrices help to select the discriminative feature structures for real-time action prediction.

3. Depth-Based Feature Construction

In this section, a detailed description of two proposed depth-based features is given: the PRJTs and the DPMMs. These features can characterize the spatiotemporal relations among joints and the depth appearance of human-object interaction, respectively.

3.1. Pairwise Relative Joint Orientations

For a human action in depth video, suppose that joint locations of a human body are detected by the skeleton detector provided by Shotton et al. [31]. Let be the 3D coordinates of -th joint at frame . The human body represented by 15 skeletal joints is shown in Figure 1 (right panel). The coordinates are normalized so that the motion is invariant to the absolute body position, the body size, and the initial body orientation. The trajectory of each joint in 3D space is spatially decomposed into three 2D joint trajectories, through projecting the original 3D joint trajectory onto orthogonal Cartesian planes. In our consideration, inspired by the observation of human skeletal actions, the relative movements between various skeletal joints provide a more meaningful description than their absolute movements (clapping is more intuitively described using the relative movements between the two hand joints). Hence, we describe the 2D trajectory of one joint relative to another instead of 2D trajectory of individual joint on each plane to capture the spatiotemporal variations between each joint pair. This relative joint trajectory is represented using a histogram of the oriented angles between temporally adjacent direction vectors. Let be the direction vector of -th joint relative to -th joint at frame in an orthogonal Cartesian plane; the oriented angles of temporally adjacent is given by

where (Figure 2 (left panel)). Then, is given as a histogram of the oriented angles calculated to represent spatiotemporal relations between joint and the other joints. Moreover, in order to encode long-term temporal relationships, is processed using the Fourier temporal pyramid (FTP) proposed by Wang et al. [5]. As a result, we obtain the pairwise relative joint orientations (PRJOs) feature .

3.2. Depth Patch Motion Maps

While the PRJOs features can characterize the relative movement between joints, they cannot accurately record the interactions between human and object. As a result, another depth-based feature is designed to describe the depth appearance of human-object interaction. DMMs proposed by Yang et al. [10] can effectively encode the shape and motion cues of a depth sequence. In this paper, based on DMMs, we propose the depth patch motion maps (DPMMs) to describe the temporal dynamics of the depth appearances of human-object interaction according to 3D locations of joints.

First, in sake of computational simplicity, we project depth frames onto three orthogonal Cartesian planes as in Yang et al. [20]. More specifically, the three 2D projected maps correspond to front, side, and top views, denoted by , where . Different from Yang et al. [20], each projected map is divided into different local patches according to the locations of joints on each frame (Figure 2 (right panel)), and the motion energy is computed without thresholding.

Then, for a depth sequence with frames, the depth patch motion map (DPMM) of joint under projection view is given by stacking the motion energy across an entire depth sequence as follows:

where and . represents the temporal dynamics of the depth appearances around joint . Since the HOG descriptors of the DPMMs are not calculated as done in [20] and image resizing process is applied to DPMMs but not to each projected map as done in [20], the computational complexity of the feature extraction is greatly reduced.

3.3. Combination of Depth-Based Features

For each joint , we use the concatenation of the PRJOs feature and DPMMs feature to denote the overall depth-based feature vector of its corresponding joint , where is the PRJOs feature and is the DPMMs feature.

4. Group Sparse Regression-Based Learning Model

In this section, we propose a group sparse regression-based learning model for human action prediction with a depth camera.

4.1. Model Formulation

To train a human action predictor, we divide each completed training sequence into segments as in [7, 8]. Let be the feature matrix based on the concatenation of PRJOs and DPMMs for training samples of the -th segment, in which and operator partitions each into parts according to the number of joints in each frame. Labeling a subsequence as the label of its full sequence could make confusing. To solve this problem, we learn a label for each subsequence and define the label of the -th subsequence of full sequence. The corresponding labels of for action classes are with and . To obtain the weight matrix set as action predictor based on mining the discriminative features of each type of input samples with segments, we propose a group sparse regression-based learning model as follows:

4.2. Model Optimization

To optimize predictor , we can obtainin which is a block diagonal matrix with the -th diagonal block as , is an identity matrix, and is the -th part of obtained by operator in which . Since is dependent on , we give an iterative algorithm described in Algorithm 1.

Input: .
Output:
 1: Let =1. Initialize .
 2: while not converge do
 3:  Calculate the block diagonal matrix , where the -th diagonal block of is .
 4:  Update with
 5:  
 6: end while
4.3. Activity Prediction

Given an ongoing action sequence, we first extracted the depth-based feature based on the PRJOs and DPMMs. Then, based on as final feature representation, a linear SVM classifier is employed to make the final prediction decision.

5. Experiments

In this section, we evaluated our approaches on three public benchmarks. Throughout our experiments, we apply the LIBSVM software provided by Chang et al. [32] with our final feature representation to train our linear SVM classifier.

5.1. Experimental Setting

According to its intraclass variations and choices of action classes, MSR-Daily Activity dataset [5] is one of the most challenging benchmarks for human action recognition. This dataset contains 16 types of actions: drink, eat, read book, call cell phone, write on a paper, use laptop, use vacuum cleaner, cheer up, sit still, toss paper, play electronic game, lie down on sofa, walk, play a guitar, stand up, and sit down. A skeleton has 20 joint positions. The total number of the action samples is 320. Most of the actions involve human-objective interactions. UTKinect-Action dataset [27] consists of depth sequences captured using a single stationary Kinect. The 3D locations of 20 joints are provided with the dataset. This dataset contains action types: walk, sit down, stand up, pick up, carry, throw, push, pull, wave hands, and clap hands. There are 10 subjects; each subject performs each action twice. The SYSU 3D HOI dataset [33] is a new challenging action recognition dataset. This dataset contains 12 action types: drinking, pouring, calling phone, playing phone, wearing backpacks, packing backpacks, sitting chair, moving chair, taking out wallet, taking from wallet, mopping, and sweeping. The 3D locations of 20 joints are associated with each frame of the human action sequence.

For the MSR-Daily Activity dataset, UTKinect-Action dataset, and SYSU 3D HOI dataset, we first investigate the performance of our approach for recognizing human actions in real-time fashion using complete observations. We follow the same experiment setting as other related works. For the three datasets, we use half of the subjects for training and the other half for testing. Then, we perform evaluation of our approach for real-time action prediction on the three datasets.

5.2. Experimental Results

As shown in Table 1, for the MSR-Daily Activity dataset, UTKinect-Action dataset, and SYSU 3D HOI dataset, the proposed approach achieves high accuracies, which are much better than the reported results of other real-time state-of-the-art methods. Besides, it is clear that, only using skeletal feature, our approach can also perform better than other methods, since our PRJOs feature can capture the spatiotemporal relations among joints and our group sparse learning model can mine discriminative features according to the sparse joint set. Although deep learning models have achieved great progress in action recognition, they cannot model the spatial complex structure among skeletal joints very well. Moreover, the experimental results also show the benefit of the combination of our skeletal feature and depth appearance feature.

Figure 3 shows the confusion matrices for the MSR-Daily Activity dataset (left panel), the UTKinect-Action dataset (middle panel), and the SYSU 3D HOI dataset (right panel). We can see that our approach works very well. The confusions occur when the two actions are highly similar to each other like “drinking” and “calling phone” in the case of SYSU 3D HOI dataset (right panel), or the similar actions with slight movements such as “sit still” and “play electronic game” in the case of MSR-Daily Activity dataset (left panel).

Figure 4 shows the accuracy rates for early prediction of human actions for our proposed method and BIPOD representation method based on the MSR-Daily Activity dataset (left panel), the UTKinect-Action dataset (middle panel), and the SYSU 3D HOI dataset (right panel). From Figure 4, it is clear that our proposed method has pretty good performance in early action prediction. This is because our regression model makes use of the segments that contains partial action executions for obtaining a reliable predictor.

6. Conclusions

This paper presents a novel sparse regression learning approach for real-time depth-based action prediction. We first introduce the pairwise relative joint orientations (PRJOs) and depth patch motion maps (DPMMs) to construct the associated depth-based feature for describing each individual joints. Then, a group sparse regression-based learning model is proposed to learn action predictor by mining a sparse combination of the associated depth-based features for discriminatively representing all the available human action classes. Finally, an SVM classifier is trained for action prediction decision based on the learned feature representation. State-of-the-art results are achieved in different experiments, which shows the effectiveness of our proposed approach.

Data Availability

Previously reported MSR-Daily Activity dataset and UTKinect-Action dataset were used to support this study and are available at [DOI: 10.1109/TPAMI.2013.198 and DOI: 10.1109/CVPRW.2012.6239233. These prior studies and datasets are cited at relevant places within the text as references [5, 27].

Conflicts of Interest

There are no conflicts of interest.

Acknowledgments

This work is supported by the Foundation of Heibei Department of Human Resources and Social Security (no. C201810 “河北省引进留学人员资助项目(课题)" (in Chinese)), the National Science Foundation of China (no. 61602148), and the Foundation of Heibei Educational Department (no. QN2018018).