Abstract

We propose a system that can recognize daily human activities with a Kinect-style depth camera. Our system utilizes a set of view-invariant features and the hidden state conditional random field (HCRF) model to recognize human activities from the 3D body pose stream provided by MS Kinect API or OpenNI. Many high-level daily activities can be regarded as having a hierarchical structure where multiple subactivities are performed sequentially or iteratively. In order to model effectively these high-level daily activities, we utilized a multiclass HCRF model, which is a kind of probabilistic graphical models. In addition, in order to get view-invariant, but more informative features, we extract joint angles from the subject’s skeleton model and then perform the feature transformation to obtain three different types of features regarding motion, structure, and hand positions. Through various experiments using two different datasets, KAD-30 and CAD-60, the high performance of our system is verified.

1. Introduction

Vision-based activity recognition has found many applications such as human-computer interaction [1, 2], surveillance [3, 4], robot learning [5, 6], and user interface design [7, 8]. Recently many researchers tend to use depth cameras like Microsoft Kinect to detect human activities. Unlike conventional RGB cameras, Kinect-style depth cameras can provide us with the depth information in addition to colors of the target object. Depth information can be used to estimate the 3D body poses of a human and to recognize his/her real-time activities. In this paper, we propose a system that can effectively recognize daily human activities with a Kinect-style depth camera. Our system utilizes a set of view-invariant features and the hidden state conditional random fields (HCRF) [9, 10] model to recognize human activities from the dynamic body pose estimates provided by MS Kinect API or OpenNI. Many high-level daily activities can be regarded as having a hierarchical structure, where multiple subactivities are performed sequentially or iteratively. Our system utilizes a multiclass HCRF model to represent effectively hierarchical nature of such activities.

Many existing systems often make use of only 3D coordinates of individual body joints as a feature set for activity recognition. However, these joint coordinates can be affected easily by change of Kinect’s viewpoint [11, 12]. In order to meet the view variance problem and get more informative features, our system extracts joint angles from the subject’s skeleton model and then performs the feature transformation to get three different types of features regarding motion, structure, and hand positions.

The remainder of this paper is structured as follows. In Section 2, we briefly introduce the related works. Section 3 presents a comparison of various probabilistic graphical models including HMM, MEMM, CRF, and HCRF. Section 4 concentrates on the design of our activity system. Section 5 presents the conducted experiments using two different datasets and results obtained with our system. Finally, Section 6 summarizes our work and outlines the future work.

The most important factors to affect the performance of vision-based activity recognition systems are both the set of features and the recognition model to capture the unique characteristics of individual activities. Previous works adopt different features and models from each other, resulting in distinct strength and weakness in performance.

In Xia et al.’s work [13], histograms were extracted from the joint coordinates as features using modified spherical coordinate systems in order to overcome the view variance problem. However, for different activities that involve similar positions of the joints, the system could generate similar histograms, hence making it difficult to distinguish between the two activities. In this work, activities are modeled with Hidden Markov Model (HMM). The HMM is a widely used probabilistic graphical model to process a time-series data. However, this model has a limitation that current observations are only dependent on the current state, not on any previous states or observations. Moreover, it has another limitation on training efficiency since it requires supervised training to maximize the joint probability of observation and state sequences. On the other hand, in Sung et al.’s work [14], joint angles are used as features instead of the corresponding joint coordinates to overcome the view variance problem. The hierarchical Maximum Entropy Markov Models (MEMMs) are adopted to model the hierarchical nature of activities as well as enhance the training efficiency. However, MEMMs are well known to suffer from the label bias problem.

In Zhang and Tian’s study [15], spatiotemporal features and Support Vector Machines (SVMs) were used to represent activities. However, the features do not consider the view variance problem and SVMs are limited in training human activity patterns over time in comparison with probabilistic graphical models. In Ong et al.’s work [16], features based on the human range of movement were extracted from joint poses and -means clustering which is an unsupervised learning method is applied to recognize daily activities. However, the features of this work are sensitive to camera view variance and the range of motion of joints may vary from person to person. It recognizes activities through -means clustering without training a model. However, -means clustering has several limitations that the number of clusters should be predetermined and the resulting clusters may be varied depending on the given initial clusters as well.

3. Probabilistic Graphical Models

Probabilistic graphical models [17] can be considered as one of the best ways to represent hierarchical structures of high-level daily activities, where multiple subactivities are performed sequentially or iteratively. Among the widely used probabilistic graphical models for activity recognition are the Hidden Markov Model (HMM), the Maximum Entropy Markov Model (MEMM), and the Conditional Random Fields (CRF) as shown in Figures 1(a)1(c), respectively.

The HMM in Figure 1(a) is a generative graphical model in which the target system to be modeled is assumed to be a Markov process. In the figure, the variables , , and represent the observation, the hidden state, and the class label, respectively. This model assumes that the conditional probability distribution of the hidden variable at time depends only on the value of the hidden variable . Similarly, it assumes that the value of the observation variable only depends on the value of the hidden variable . This means that the HMM presumes independence of the observations. Therefore, this model cannot represent long-range dependencies among observations. Additionally, it has another limitation on training efficiency since it requires supervised training to maximize the joint probability of observation and state sequences.

The MEMM in Figure 1(b) is a discriminative graphical model that combines the features of the HMM and the Maximum Entropy (MaxEnt) model. An advantage of MEMM over HMM is that it provides increased freedom in choosing features to represent observations. Another advantage of MEMM over HMM is that training can be considerably more efficient. In MEMM, estimating the parameters of the maximum-entropy distributions used for the transition probabilities can be done for each transition distribution in isolation. However, the MEMM has a drawback that it potentially suffers from the label bias problem, in which states with low-entropy transition distributions effectively ignore their observations.

The CRF model in Figure 1(c) is a discriminative undirected graphical model. In the figure, represents the observation sequence and represents the random variable which, conditioned on , obeys the Markov property. The CRF model can contain any number of feature functions and the feature functions can inspect the entire observation input sequence . This means that the CRF model avoids the independence assumption between observations and allows nonlocal dependencies between state and observation [18]. Moreover, this model has no label bias problem in contrast with the MEMM. However, the CRF model should assign a label to each time and do not directly provide a way to estimate the conditional probability of a class label for an entire sequence .

The HCRF model shown in Figure 1(d) is a generalized CRF model with hidden states . It incorporates hidden state variables in a discriminative multiclass random field model. By allowing a classification model with hidden states, no a priori segmentation into substructures is needed, and labels at individual observations are optimally combined to form a class conditional estimate. As an augmentation of the CRF, this model can represent long-range dependencies among observations without the label bias problem. The HCRF model was introduced by Quattoni and Gunawardana and has been successfully applied for gesture recognition and phone classification [9, 10]. Due to its advantageous characteristics, however, we believe that the HCRF model can be also successfully applied to vision-based daily activity recognition.

4. Activity Recognition System

We design a system that can recognize high-level daily activities based on the 3D body pose data acquired from Microsoft’s Kinect API. A high-level daily activity can be regarded as a hierarchical activity structure consisting of multiple subactivities activities that are performed sequentially or iteratively. For example, the activity of picking up an object on the floor consists of three subsequent subactivities: stooping down, grasping the object, and standing up, as described in Figure 2.

For the purpose of our research work, we collect the training data of such high-level daily activities to construct the KAD-30 dataset. The KAD-30 dataset consists of 10 activities in total: opening a lid, drinking water, tying shoelaces, stretching, eating cereal, making a phone call, grasping an object on the floor, putting on and taking off a coat, cleaning the floor and writing on a whiteboard. The proposed activity recognition system consists of three steps: feature extraction, model learning, and activity recognition.

4.1. Feature Extraction

In this step, view-invariant features are extracted based on 3D position data from 15 joints of the human body, including the head, neck, and torso, and two sets of joint directional data that correspond to the head and torso. As mentioned before, the set of 3D joint positions are directly provided by Microsoft’s Kinect API, which can be estimated from the depth images acquired from the Kinect sensor. However, the 3D position of each joint provided by Kinect API is represented based on the Cartesian coordinate system of which origin is on the center of the Kinect sensor. Thus, the 3D position data of a joint can be easily changed if at least either the Kinect sensor or the target object changes its position. This means that the 3D joint coordinates of joints directly acquired from Kinect API are very sensitive to Kinect’s view variance, and so they are not proper features used to distinguish daily human activities robustly under various environmental conditions. Figure 3 illustrates the view variance problem. As shown in the figure, if Kinect’s view is changed, the corresponding position value of the same elbow joint captured by the Kinect sensor will be also changed. In order to meet the view variance problem and get more informative features, our system extracts joint angles from the subject’s skeleton model and then performs the feature transformation to get three different types of features regarding motion, structure, and hand positions.

While performing one of the daily activities, each joint of the performer moves according to a specific pattern over time. These temporal patterns of joint movement may be effectively captured by using motion features. In addition, daily activities are considered to be performed through multiple interactions between distinct joints. For example, grasping an object on the floor is mainly accomplished through interaction between the joints of the knee and the hand. We try to capture these spatial patterns through structure features. A lot of human daily activities include hand movement. Unlike other animals, humans use their hands very much to work in daily life. For example, consider when drinking water and opening the lid of a container. Hand position features, which represent the position of both hands relative to the head and the torso, can help distinguish human daily activities using hands.

Figure 4 illustrates the process of extracting the motion and the structure features. As shown in Figure 4, the 3D Cartesian coordinates of the form is first transformed into 2D spherical coordinates of the form for each joint, where is the polar angle and is the azimuthal angle of the joint. The following equation shows how to compute the polar and the azimuthal angles from the corresponding 3D joint coordinates . In the equation, is the radial distance, which is the Euclidean distance from the origin to the joint. In our work, the radial distance is omitted and only the polar and the azimuthal angles are used to extract features through subsequent processes:

From the transformed 2D spherical coordinates of each joint , motion features and structure features are calculated through the following equations. Below, and refer to the frame and joint indexes, respectively:

The motion features of joint are obtained from the th input frame by computing the difference between the current and the previous position of the joint . Hence the motion features represent the positional change of each joint from the th frame to the th frame. On the other hand, the structure features of joint are extracted from the th input frame by computing the difference between the current position of the joint and the current position of the other joint . Here assume that the joint is, for example, the center of the head, the joint can be one of the other joints, such as the neck or the torso. Hence the structure features represent the relative position of the joint based on the other joint at the th frame. It is assumed that the position of each joint at frame has already been transformed into 2D spherical coordinates in the aforementioned way.

Figure 5 describes the process of extracting the hand position features. The position features of each hand are obtained by computing its relative positions with respect to both the head and the torso. For example, while the relative position features of the left hand with respect to the head are computed through (3), its relative position features with respect to the torso are calculated through (5). Similarly, the relative position features and of the right hand are computed through (4) and (6), respectively. In the equations, , , , and represent the 3D position vector of the left hand, the right hand, the head, and the torso, respectively. On the other hand, and are the orientation matrix of the head and the torso, respectively:

In general, the higher the number of feature vector dimensions, the higher the computational complexity required for model learning and activity recognition. The feature vectors acquired from the feature extraction process have 252 dimensions. Vector quantization is executed by applying -mean clustering to the high dimensional feature vectors to increase the efficiency of model learning and activity recognition. Through vector quantization, each high dimensional feature vector is replaced into an integer index indicating the cluster the feature vector belongs to. As a result, one-dimensional integer type time-series data is generated while performing an activity. Here, because the length of the time-series data is determined by performing time per activity, a different length per activity is generated. The subsequent processes of the proposed activity recognition system, modeling learning, and activity recognition use these time-series feature data of each activity for the purpose of model training and testing.

4.2. Model Learning

As mentioned before, many high-level daily activities can be regarded as having a hierarchical structure, where multiple subactivities are performed sequentially or iteratively. Our system utilizes the hidden state conditional random field (HCRF) model to represent effectively the hierarchical nature of such activities. In order to recognize a number of activities with a single trained model, our system uses a multiclass HCRF model. A state variable in this HCRF model represents a subactivity belonging to a high-level activity and it is assumed to be hidden. Therefore, there is no need to designate a label for each subactivity in the training data.

Figure 6 shows the process to learn the optimized parameters of the HCRF model. The parameter vector is made up of three different components: , , and . refers to the parameters corresponding to state . Similarly, stands for the parameters corresponding to class and state . refers to the parameters corresponding to class and the pair of states and . In order to learn the optimized parameters from the initial parameters , the training data of the form are used, where is an observation sequence and is the label of activity class.

In the model learning process, the optimized parameters are searched to maximize the objective function using the training dataset. The first term of the objective function includes the conditional probability . The conditional probability of a class label given the observation is defined as in the following equation:The objective function depends on the potential function , parameterized by , which measures the compatibility among a label, a set of observations and a configuration of the hidden states. Using the gradient ascent method, the optimized parameters are found to maximize the objective function , as in the following equation:

The number of hidden states and the size of history are determined in advance in order to train the HCRF model. In our system, the number of hidden states of the HCRF model is set to 7, considering the complexity of the target activities. The history size, which determines dependency range, is set to 1. As the optimization function to adjust the weight of feature vectors in the HCRF model, Limited-Memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) is used.

4.3. Activity Recognition

In the activity recognition step, the conditional probability of each activity, , is calculated using the trained HCRF model and the test sequence data . And then the test data is recognized as the activity with the highest conditional probability, as in the following equation:

5. Performance Evaluation

Based on the design explained before, our activity recognition system was implemented using C++ and MATLAB on Windows 7. Several experiments were conducted to evaluate the performance of our proposed activity recognition system. In the experiments, two different datasets are used: the KAD-30 dataset from Kyonggi University and the CAD-60 dataset from Cornell University. Figure 7 shows 10 common daily activities included in the KAD-30 dataset. The activities in the KAD-30 dataset are opening a lid, drinking water, tying shoelaces, stretching, eating cereal, making a phone call, picking up an object on the floor, putting on and taking off a coat, wiping the floor, and writing on a whiteboard. To collect the KAD-30 dataset, 3 different subjects performed 10 different activities ten times in front of the Kinect sensor. 3D body pose data for each activity were recorded for 30 to 40 seconds at 30 frames/second speed.

Figure 8 shows 12 daily human activities in the CAD-60 dataset provided by Cornell University. The activities included in the CAD-60 dataset are brushing teeth, cooking (stirring), writing on a whiteboard, working on computer, talking on the phone, wearing contact lens, relaxing on couch, opening pill container, drinking water, cooking (chopping), talking on couch, and rinsing the mouth.

To analyze the performance of our activity recognition system, three different experiments were conducted using the KAD-30 and CAD-60 datasets. In the first experiment, we compared the recognition performance of two different HCRF models: one-versus-all HCRF model and multiclass HCRF model. A one-versus-all HCRF model is able to distinguish only one activity from others. In order to recognize different activities, a total of one-versus-all HCRF models need to be learned. On the other hands, the single multiclass HCRF model can be learned to recognize different activities. In addition, we conducted the experiment with different sizes of history to analyze the effect of long-range dependency by setting for one model and for the other.

Table 1 summarizes results of the experiment to compare the recognition performance between the one-versus-all HCRF model and the multiclass HCRF model. The multiclass HCRF model performs better than the one-versus-all HCRF model. The performance of HCRF models made a significant improvement when the history size was increased, which indicates that incorporating long-range dependencies was useful.

In the second experiment, we analyzed the recognition performance per activity of the multiclass HCRF model. For this experiment, we set the history size of the multiclass HCRF model to 1. Figure 9 shows two confusion matrices for the KAD-30 and CAD-60 datasets as results of the experiment. In the case of the KAD-30 dataset, the activity of writing on the whiteboard showed the lowest recognition accuracy. This was because the hands of the target subject were often hidden by his/her torso while writing on the board. For the CAD-60 dataset, opening a pill container and wearing contact lens activities showed lower recognition accuracies than other activities. This was due to insufficient available information as these activities took a shorter time to perform than the others.

In the third experiment, we compared the recognition performance among three different probabilistic graphical models: HMM, CRF, and multiclass HCRF. Due to their inherent assumptions and structures, these models have different power of expression. Therefore, we expect that the activity recognition with different models will result in different performances. Table 2 summarizes the results of the experiment to compare the recognition performance among three different probabilistic graphical models. In this experiment, our multiclass HCRF model with the history size set to one performs better than the HMM, the CRF, and even the multiclass HCRF model with the history size set to zero . The HMM performed better than the CRF model for both the KAD-30 and the CAD-60 datasets. In this experiment, hidden state models such as HMM and HCRF perform better than nonhidden state models like CRF. This result implies that hidden state models are very effective to learn the hierarchical structure of high-level human activities. We also found that the CRF and the multiclass HCRF models made some improvements when the history size was increased. This result indicates the useful effect of long-range dependencies in the CRF and the HCRF models.

6. Conclusions

In this paper, we proposed a daily activity recognition system that applies the multiclass HCRF model to Kinect sensor data. The HCRF model is used to represent the hierarchical structure of high-level daily activities in effect. In addition, the proposed system extracts three kinds of view-invariant features from 3D joint coordinates provided by Kinect API. These features represent various characteristics of high-level daily activities. These characteristics include the movement pattern of each joint over time, the structural relationship between two different joints at an instant time, and the relative positions of both hands. Through experiments using the KAD-30 dataset from Kyonggi University and the CAD-60 dataset from Cornell University, the high recognition performance of the proposed system was verified.

In the future, our research highlights would be focused on the following points. On the one hand, we will optimize our system so as to further improve the performance. On the other hand, our system will be extended for many useful applications such as home healthcare, human robot interaction (HRI), and other context-aware services.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work was supported by the GRRC Program of Gyeonggi province.