Abstract

This paper proposes a novel approach to decompose two-person interaction into a Positive Action and a Negative Action for more efficient behavior recognition. A Positive Action plays the decisive role in a two-person exchange. Thus, interaction recognition can be simplified to Positive Action-based recognition, focusing on an action representation of just one person. Recently, a new depth sensor has become widely available, the Microsoft Kinect camera, which provides RGB-D data with 3D spatial information for quantitative analysis. However, there are few publicly accessible test datasets using this camera, to assess two-person interaction recognition approaches. Therefore, we created a new dataset with six types of complex human interactions (i.e., named K3HI), including kicking, pointing, punching, pushing, exchanging an object, and shaking hands. Three types of features were extracted for each Positive Action: joint, plane, and velocity features. We used continuous Hidden Markov Models (HMMs) to evaluate the Positive Action-based interaction recognition method and the traditional two-person interaction recognition approach with our test dataset. Experimental results showed that the proposed recognition technique is more accurate than the traditional method, shortens the sample training time, and therefore achieves comprehensive superiority.

1. Introduction

Over the last few decades, human activity analysis has undergone rapid development receiving increasing attention in many fields, such as intelligent surveillance, human-computer interaction, and elder care management [1, 2]. Human activity can be categorized according to complexity as partial body action [3], simple action [4], interaction activity [5, 6], or group activity [7]. Motivated by the activity classes drawn from [5, 6], this paper focuses on two-person interaction recognition of six complex interactions: kicking, pointing, pushing, punching, exchanging an object, and shaking hands.

Much research has been done on two-person interactions [510] with respect to the kinds of complex action relationships and human features necessary for recognition. For example, [5] took into account whether one person’s hand is above another’s shoulder or whether one person’s foot is near another’s torso. Reference [6] used head-pose, arm-pose, leg-pose, and overall body-pose estimation with both people for recognition. However, these processes are complex and time consuming and the recognition results might not be as accurate as required for a particular application. This paper proposes a new definition for interactions based on one person’s behavior called Positive Action. In this method, one person’s action plays the key role in an interaction; thus, two-person interaction recognition can be simplified into Positive Action recognition. This approach is simpler than traditional methods, saves computing time, and improves recognition results.

The recent proliferation of a cheap but effective depth sensor, the Microsoft Kinect [11], has created more opportunities for quantitative analysis of complex human activities. As compared to the traditional video camera, Kinect has the advantage of synchronous acquisition of color and depth images; with the use of depth maps, 3D information about a scene from a particular point of view is easily computed under diverse conditions [12]. This in turn will make behavior detection easier in badly lit or dark places. For example, Figure 1(a) represents a depth image captured by Kinect in weak light, which clearly shows one person punching at another; Figure 1(b) shows a color image of this interaction synchronously captured with the depth image. With a traditional camera, only RGB images as seen in Figure 1(b) are collected, with limited value for surveillance and other applications. Unfortunately, there are few publicly accessible test datasets to assess two-person interaction recognition approaches using the depth sensor. Thus, we created a new dataset for two-person interaction. The first version of this original dataset is available to download on the Internet at http://www.lmars.whu.edu.cn/prof_web/zhuxinyan/DataSetPublish/dataset.html.

The Microsoft Kinect sensor produces a new type of data, RGB-D data, which is an improvement on RGB images for human behavior recognition research. Therefore, many researchers have collected their own data and some of them are publicly accessible on the Internet [1315]. In [16], Sung et al. produced a dataset including a total of twelve unique activities in five realistic domestic environments: office, kitchen, bedroom, bathroom, and living room. The RGBD-HuDaAct video database [17] collected in a lab environment includes 12 categories of human daily activities: making a phone call, mopping the floor, entering a room, and so forth. The LIRIS human activity dataset contains (gray/RGB/depth) videos showing people performing various activities taken from daily life (discussing, making telephone calls, exchanging an item, etc.); it includes information on not only the action class but also the spatial and temporal positions of objects in the video. However, these datasets only address individual activities and not two-person interactions [18].

Several more-than-one-person datasets were created using Kinect. In [19], the UT Kinect-human detection dataset was created: there are 98 frames with two people appearing in the scene at different depths in a variety of poses, including several simple interactions. In addition, [5] chose eight types of two-person interactions to establish another two-person dataset, including approaching, departing, pushing, kicking, punching, exchanging objects, hugging, and shaking hands. However, this latter dataset is not publicly available on the Internet.

Depth imaging data produced by the Kinect sensor is driving new single and daily activity recognition problem research. For human activity or behavior representation, the method in [16, 20] detected and recognized different activities through body-pose features, hand position features, and motion information, using the Kinect sensor. In [17], Ni et al. proposed depth-extended feature representation methods to obtain superior recognition performance based on RGBD-HuDaAct datasets. Nowozin and Shatton [21] used skeletal features: joint velocities, joint angles, and joint angle velocities to reduce the latency in recognizing an action.

For human activity or behavior recognition, most efforts use HMM-based approaches. Park and Aggarwal [6] used HMMs for human motion recognition and combined it in a hierarchical way using DBNs (Dynamic Bayesian Networks). Vogler and Metaxas [22] presented parallel HMMs to recognize American sign language based on magnet tracking data, while Wilson and Bobick [23] proposed parametric HMMs to recognize human gestures. HMM-based recognition of more complex sequences is addressed by [2426]. The method proposed in [24] was able to recognize motion units with optical flow data; in [25], Li proposed a landmark point trajectories-based approach to recognize view-invariant human actions and Chen et al. [26] presented a star skeleton model to recognize a single action and a series of actions.

Presently, there is little human interaction research based on Microsoft Kinect data and few papers report on a complex human activity dataset created to depict two-person interactions [5]. This research concluded that activity recognition represented by geometric relational features based on distance between all pairs of joints outperforms other feature choices. Our proposed approach and test dataset extend this research.

The contribution of this paper is twofold; we developed an efficient approach based on Positive Action representation to recognize two-person interactions and created a new dataset based on the Kinect sensor to test and verify methods. The rest of this paper is organized as follows. Section 2 shows our interaction dataset; Section 3 details the Positive Action definition and feature extraction method; Section 4 presents the Positive Action and the traditional interaction recognition method via HMMs; Section 5 demonstrates experimental results from two different approaches using our test dataset; finally Section 6 concludes this paper and discusses future work.

2. K3HI: Kinect-Based 3D Human Interaction Dataset

We collected two-person interactions using a Microsoft Kinect sensor. All videos were recorded in an indoor room while 15 volunteers performed activities. Each pair of people performed all types of interactions. The dataset has a total of approximately 320 interactions organized into eight categories. The first version of this dataset has been made publicly available to the research community to encourage progress in human action studies based on this new technology (http://www.lmars.whu.edu.cn/prof_web/zhuxinyan/DataSetPublish/dataset.html). Since approaching and departing activities are simple, recognition accuracy for both interactions was almost 100% [5, 6]; therefore, we choose other types of relatively complex two-person interactions for recognition studies.

The most important data in our dataset is the spatial information (3D coordinates) of the two persons’ skeletons. In order to ensure the integrity and continuity of target data, the original RGB images and depth information were ignored when capturing data. An articulated skeleton for each person was extracted using the OpenNI software [27] and Natural Interaction (NITE) Middleware provided by PrimeSense [28]. A skeleton was represented by the 3D positions of 15 joints, including head, neck, left shoulder, right shoulder, left elbow, right elbow, left hand, right hand, torso center, left hip, right hip, left knee, right knee, left foot, and right foot. However, when two persons overlapped, especially in a hugging activity (e.g., see Figure 2), full body tracking of interactions with NITE Middleware might be inaccurate. Bad and lost tracking will seriously affect interaction results, so hugging was not considered in our dataset. At last, six types of two-person interactions were captured, including kicking, punching, pointing, pushing, exchanging an object, and shaking hands. Figure 3 visualizes the collected interaction data as represented in the form of skeletons with different colors representing different actors.

3. Positive Action Representation

3.1. Positive Action Definition

Most existing work about human interactions focuses on two people, considering what kind of action relationship they have and what kind of features should be chosen to best represent an interaction [5, 6, 1012]. Interactions can be classified into two groups: the first group indicates that one person acts first and the other person gives a responsive action, for example, kicking, pointing, punching, pushing, and so forth; the second group of interactions represents both people performing an almost identical synchronous action, for example, exchanging an object, shaking hands, and so forth. We propose that an interaction can be decomposed into a Positive Action and a Negative Action. For interactions in the first group, the person who acts first, resulting in the other person’s reaction, performs a Positive Action. In the second group, since both people’s behavior is similar and synchronized, we simply define the action, which moves with greater position changes in the first few frames, as the Positive Action. In all cases, a Negative Action is defined as a reciprocal action corresponding to a Positive Action in a two-person interaction.

After a Positive Action is identified, complex interaction recognition becomes relatively easy. Figures 4(a)–4(f) represent the original two-person interactions which were tested in [6], while Figures 4(a′)–4(f′) show the simplified results that the complex interactions are reduced into Positive Action-based representations. It can be seen that Positive Actions are discriminated with each other; therefore, only one person’s features are taken into account and traditional interaction recognition can be transformed into Positive Action recognition.

3.2. Positive Action Extraction

Next, we obtained the Positive Actions in our dataset by means of mathematical analysis, especially for interactions in the first group as defined in Section 3.1. The window size for each interaction was approximately 25 frames. We only kept the first ten frames—since the action changes in the first few frames are enough to distinguish Positive Action and Negative Action. The extraction process for Positive Action is divided into the following three procedures.

(1) Aligning the Sequence. For an interaction activity, there are always time or frame length variances when capturing the data. Before discerning a Positive Action, we first select the interactions of the same class to align the sequences. Then, the Dynamic Time Warping (DTW) model is used to align the sequences of the same activity class as mentioned in [29]. For each class, we selected a standard interaction sequence suitable for representation of the interaction process. We computed separately the minimal DTW distance between the remaining interaction sequences and the standard interaction sequence in the same class to find the optimal alignment.

In the DTW process, we express the feature vectors of two different sequences (in the same interaction class) as two time series (or frame series) and, , defined as follows:

Accordingly, the costs between two series will be lower if they are similar, meaning that if two sequences are well aligned, the minimal DTW distance will be defined as where is the feature distance at time and in two sequences and .

It is known that there are two persons’ 3D joint positions in an activity sequence, represented as where and are the position set of the first and the second persons, respectively; and are the frame index and the joint index. We used the joint positions to characterize the feature in each frame for a distance computation between and . The distance is described as where indicates the Euclidean distance at time and time . Then, we placed the Euclidean distance into formula (4) to obtain the minimal DTW distance, finding the optimal alignment between variable length interaction sequences.

(2) Computing Key Joint Position Changes. We selected eight joints as key joints, which represent changes in the body’s motion; these joints include the left and right elbow, left and right hand, left and right knee, and left and right foot.

The position changes of the joints were described by calculating the distances between neighboring frames, defined as follows: where is the Euclidean distance of a key joint between frame and ; indicates the position of joint at frame and are the 3D coordinates.

(3) Identifying Positive Action. For actions in the first group which is defined in Section 3.1, it is tougher to extract Positive Action than it is in the second group. According to the benchmark in [30], human reaction time is around 0.2-0.3 s. Our collected data is 15 frames per second. When reaction time is converted into frames, it consists of 3-4 frames. This means that in the first group of interactions when a Positive Action starts, about 3-4 frames later, a corresponding Negative Action occurs.

In our Positive Action definition, because the joint positions in the first two adjacent frames change and conform to the benchmark, we can compare the maximum position changes of both persons’ key joints between initial and frame of a sequence. The value of for the standard interaction sequence mentioned in procedure (1) is one. For the other sequences after DTW processing, will be different value. This is expressed as follows: where and indicate the maximum position changes of joints for person one and person two in an interaction; indicates that if , will represent the Positive Action and will represent the Negative Action; otherwise, will be the Positive Action. Figure 6 shows the processing results for Positive Actions, ignoring the Negative Actions. Each action has its own distinct characteristics, including easily confused interactions, such as exchanging an object and shaking hands.

Positive Action extraction is much easier in the second group as compared with the first group. According to the definition of Positive Action for group two, we also use (6); therefore, the person with the maximum performs the Positive Action.

In order to verify the method which is used to extract Positive Action, we selected the “kicking” action from the first group of interactions and “shaking hands” from the second group and calculated the position changes using (5) for the first 10 frames. Figure 5 shows the results: from Figure 5(a), it can be seen that as person one’s right foot and right knee positions change from the first frame to the third frame, person two’s left and right elbows as well as left and right hands positions also change in the fourth frame. These changes suggest that when person one starts to kick, person two’s upper limbs react milliseconds later so that the first person’s motion belongs to the Positive Action. However, Figure 5(b) does not show any connection between the two behaviors, except that both of their right hands and elbows move in a synchronized fashion. In general, experimental results support our Positive Action extraction method.

The visualization of Positive Actions is shown in Figure 6. Table 1 represents the extraction results for Positive Action with and without DTW for the first group, illustrating that the extraction results for Positive Action have greater accuracy after DTW preprocessing.

3.3. Feature Extraction

After Positive Actions are extracted, we utilize several body-pose features for motion-capture data representation and evaluate these features using our test dataset. One of the biggest challenges when using skeleton joints as a feature is that semantically similar motions may not necessarily be numerically similar [31]. To overcome this, [32] used relational body-pose features as introduced in [31], describing geometric relations between specific joints in a single pose or a short sequence of poses. Relational pose features were used to recognize daily-life activities performed by a single actor in a random forest framework; the features included joint, plane, and velocity features.

(i) Joint Features

Joint Distance. Let be the 3D location of joint in a Positive Action at time . The joint distance feature is defined as the Euclidean distance between two joints at time and is represented as where and are any two joints of a single person ().

Joint Motion. Similar to the joint distance feature, the joint motion feature is defined as the Euclidean distance between joints at time and at time . It captures Positive Action joint motions and is represented as

(ii) Plane Features

Plane Feature. captures the geometric relationship between a plane and a joint; helps to express whether the left hand lies in front of the plane spanned by the right shoulder, left shoulder, or torso. It is defined as where indicates the plane spanned by three other joints , , and . represents the Euclidean distance from joint to the plane.

Normal Plane Feature. is similar to a plane feature; it helps to determine if and how far the joint “hand” is raised above the “shoulder”; is defined as follows: where is the joint as in a plane feature and indicates that the plane with normal vector passing through . , and represents different joints.

(iii) Velocity Features

Velocity Feature. captures the velocity of one joint along a direction generated by two other joints at time . is defined as where , , and are different joints.

Normal Velocity Feature. is similar to a normal plane feature; it captures the velocity of one joint along the direction of the normal vector of the plane generated by three other joints at time . is defined as where is the unit normal vector of the plane represented by when , , , and are different joints.

4. Positive Action Recognition via HMM

Hidden Markov Models (HMMs) are widely used for modeling time series data. Formally, a HMM can be described as a 5-tuple , where are the hidden variables and are the transitions probabilities among states; these probabilities, as well as the starting probabilities , are discrete. Every observation state has a set of possible emissions and discrete/continuous probabilities for these emissions. A Gaussian Mixture Model (GMM) is used to represent the observation states for each hidden variable and to compute their probabilities [33]. GMM density is defined as the weighted sum of Gaussian densities.

In the training process, HMM parameters are initialized: we manually decided the observation states’ number and hidden states’ number ; then we divided equally the data sequence into parts and clustered each part using -means to establish the GMM. After the HMM parameters are known, the Baum-Welch algorithm, also known as the Forward-Backward algorithm, was used to reevaluate the HMM parameters and to compute the output probability of observation sequence (indicating the sample sequence of action ). Finally, the sequence probabilities are summed up and HMM parameters are confirmed until we get the maximum value . After training, we have six HMMs for each type of action.

During the recognition process, given the data sequence of unknown action , the feature vectors are extracted for each frame. Using the Viterbi algorithm, the likelihood of observation sequence is generated. We repeated this procedure based on the six HMMs generated in training process and produced the probabilities . Thus, by comparing the values , we obtained the maximum likelihood , which represents the type of interaction.

5. Experimental Results

We selected the features extracted from among the Positive Actions identified in Section 3.2 to recognize interactions and used the features extracted from original interaction data as in [5]. Then, we compared and evaluated the recognition results from both approaches. The process for feature extraction and action recognition is illustrated in Figure 7.

In the Positive Action-based interaction approach, features as described in Section 3.3 were classified into three groups: joint features, plane features, and velocity features. In our experiments, we recognized six kinds of Positive Actions for each feature and mixed the features. There are fifteen joints (including 3D coordinates) for each action. Thus, the dimension of is for each frame and the was ( is the total number of frames for each interaction). Considering the larger dimensions of both plane and velocity features, we selected key joints to characterize the features. For plane features, the relationship between the four limbs and main body is critical; therefore, the plane was spanned from seven joints (“head,” “neck,” “left shoulder,” “right shoulder,” “torso,” “left hip,” and “right hip”) and eight joints for the target joint. In this way, we created a lower dimension for each frame. However, the feature dimensions were larger than the training sample number; thus, Principal Component Analysis (PCA) was used to reduce the dimensions.

To classify interactions, evaluation is done with a 4 fold cross-validation: 3 folds are used for training and 1 for testing. Based on the fact that the 3 state HMM performs much better than the 4- and 5 state HMMs in our experiments, we trained a 3 state, continuous HMM with GMM. As expected, the transition probabilities and the observation probabilities turned out to be different for different actions. After training, the HMM parameters are known while the Viterbi algorithm was used to find the maximum likelihood category. Table 2 shows the experimental results for each kind of feature representation.

For the traditional two-person relationship-based interaction recognition method (called the old approach in the rest of this paper), three kinds of features referring to [5] were also extracted based on the original captured data (see Figure 3). The training and recognition process was identical with the Positive Action-based (new) method. Figure 8 shows the recognition results in a confusion matrix: (a)–(c) represents the Positive Action-based approach and (d)–(f) for the values generated by the old approach. The confusion matrix also compares different kinds of features for recognition: joint features include the joint motion and joint distance features; plane features include the plane and normal plane features; velocity features include the velocity and normal velocity features. The average recognition accuracy for each kind of feature from (a) to (c) is 78.67%, 66.83%, and 55.67%; the average accuracy from (d) to (f) is 70.00%, 61.67%, and 48.67%. Therefore, joint features-based recognition results are better than plane and velocity features, suggesting that geometric relational features based on the distance between joints outperform other feature choices, verifying the conclusions found in [5]. Furthermore, in both the old and new approaches, there exists some confusion between “pointing” and “punching” and between “exchanging an object” and “shaking hands.” Our results show that these actions are similar, leading to lower recognition accuracy.

Most importantly, the average accuracy for interaction recognition based on Positive Action representation, as proposed in this paper, is 7% greater than two-person relationship-based approaches, especially since geometric relational feature-based recognition is almost 10% greater. There are several reasons for these results. First, a two-person feature representation is more complex than a Positive Action-based representation, creating unstable factors. For example, the “pointing” interaction in normal plane features: the Positive Action-based method only judges whether one person’s “hand” position is higher than its’ own “shoulder”; however, the old approach as in [5] must judge the spatial relationship for both persons’ shoulders, which will lead to more conditions for recognition; therefore, the Positive Action-based approach needs less training samples than the old approach to get more or less the same recognition accuracy. Second, for the same kind of feature, the Positive Action-based representation method has fewer dimensions than the old approach. The old approach therefore is more sensitive during dimension reduction in the training process; thus, its recognition accuracy will be lower.

To verify the generalizability of our proposed method, we tested the dataset against two more classifiers, including Support Vector Machines (SVMs) and Multiple Instance Learning (MIL). The test features were represented by the combination of joint distance and joint motion. The results in Table 3 suggest that MIL has better performance than SVMs while the Positive Action-based method is much better than the two-person based method. Therefore, different classifier supports the conclusion that our new method is effective.

In addition to a comparison of the interaction recognition accuracy for both approaches, we also compared time costs and evaluated the training time to arrive at optimal HMM parameters (see Figure 9). The average training time for three kinds of features based on Positive Action representation is 42.47 MS (millisecond), 79.52 MS, and 67.88 MS, while for the old approach referring to [5], the average training time is 63.27 MS, 199.6958 MS and 156.3827 MS. The Positive Action-based representation method consumes less time than the old approach.

In summary, Positive Action-based representation for two-person interaction recognition outperforms the old approach; not only is its recognition accuracy better, but also the time cost for training is less. So, the new method transforms a relatively complex two-person interaction into a simpler Positive Action, making the recognition procedure more cost effective while maintaining or even improving recognition quality. Therefore, the new proposed approach is efficient for interaction recognition.

6. Conclusion

This paper presented a novel approach to recognize relatively complex human interactions: different from many existing interaction recognition methods, we focused our research on single actions which are useful when distinguishing differences between types of interactions. Two-person interaction recognition is transformed into Positive Action-based recognition.

The key contributions of this paper are as follows: (1) we investigated the reciprocal relationships in two-person interaction and proposed a new definition for single person’s behavior called Positive Action; (2) two-person interactions were recognized based on Positive Action representation via continuous HMMs; (3) a new test interaction dataset based on Microsoft Kinect camera was created and it is publicly available; our experimental results demonstrate that the proposed method outperforms old approaches based on two-person relationships.

In the future, we plan to find more volunteers to capture more data and extend our interaction dataset to include additional interaction categories. More importantly, owing to the limitations of human tracking software, such as the NITE Middleware or the Windows SDK for Kinect, there occasionally are some inaccurate tracking results. Therefore, we need to find a better way to track human actions, further improving the recognition accuracy.

Acknowledgments

The authors are grateful to the volunteers for capturing data. This work was supported by the National Natural Science Foundation (41301517), the National 863 Key Program (2013AA122301), the National Key Technology R&D Program (2012BAH35B03), Chinese NSF Creative Research Group project (41023001), the National 973 Program (2011CB707001), the Fundamental Research Funds for the Central Universities (2012619020215), and Doctoral Fund of Ministry of Education (20120141120006).