Abstract

Nowadays, advancements in depth imaging technologies have made human activity recognition (HAR) reliable without attaching optical markers or any other motion sensors to human body parts. This study presents a depth imaging-based HAR system to monitor and recognize human activities. In this work, we proposed spatiotemporal features approach to detect, track, and recognize human silhouettes using a sequence of RGB-D images. Under our proposed HAR framework, the required procedure includes detection of human depth silhouettes from the raw depth image sequence, removing background noise, and tracking of human silhouettes using frame differentiation constraints of human motion information. These depth silhouettes extract the spatiotemporal features based on depth sequential history, motion identification, optical flow, and joints information. Then, these features are processed by principal component analysis for dimension reduction and better feature representation. Finally, these optimal features are trained and they recognized activity using hidden Markov model. During experimental results, we demonstrate our proposed approach on three challenging depth videos datasets including IM-DailyDepthActivity, MSRAction3D, and MSRDailyActivity3D. All experimental results show the superiority of the proposed approach over the state-of-the-art methods.

1. Introduction

Human tracking and activity recognition are defined as recognizing different activities by considering activity feature extraction and pattern recognition techniques based on specific input data from innovative sensors (i.e., motion sensors and video cameras) [15]. In recent years, advancement of these sensors has boosted the production of novel techniques for pervasive human tracking, observing human motion, detecting uncertain events [68], silhouette tracking, and emotion recognition in the real-world environments [911]. In these domains, the term which is most commonly used to cover all these topics is technically termed as human tracking and activity recognition [1214]. In the motion sensors-based activity recognition, activity recognition is based on classifying sensory data using one or more sensor devices. In [15], Casale et al. proposed a complete review about the state-of-the-art activity classification methods using data from one or more accelerometers. In this work, classification approaches are based on RFs features which classify five daily routine activities from bluetooth accelerometer placed at breast of the human body, using a 319-dimensional feature vector. In [16], fast FFT and decision tree classifier algorithm are proposed to detect physical activity using biaxial accelerometers attached on different parts of the human body. However, these motion sensors-based approaches are not feasible methods for recognition due to uncomfort of the users to wear electronic sensors in their daily life. Also, combining multiple sensors for improvement in recognition performance causes high computation load. Thus, video-based human tracking and activity recognition is proposed where the depth features are extracted from a RGB-D video camera.

Depth silhouettes have made proactive contributions and are the most famous representation for human tracking and activity recognition from which useful human shape features are extracted. These depth silhouettes explore research issues and are used as practical applications including life-care systems, surveillance system, security system, face verification, patient monitoring systems, and human gait recognition systems. In [17], several algorithms are developed for feature extraction from the silhouette data of the tracked human subject using depth images as the pixel source. These parameters include ratio of height to weight of the tracked human subject. Also, motion characteristics and distance parameters are used as features for the activity recognition. In [14], a novel life logging translation and scaling invariant features approach is designed where 2D maps are computed through Radon transform which are further processed as 1D feature profiles through R transform. These features are further reduced by PCA and symbolized by Linde, Buzo, and Gray (LBG) clustering technique to train and recognize different activities. In [18], a discriminative representation method is proposed as structure-motion kinematics features including the structure similarity and head-floor distance based on skeleton joint points information. However, these effective trajectory projection based kinematic schemes are learnt by a SVM classifier to recognize activities using the depth maps. In [19], an activity recognition system is designed to provide continuous monitoring and recording of daily life activities. The system includes depth silhouettes as an input to produce skeleton model and its body points information. This information is used as features and is computed using a set of magnitude and direction angle features which are further used for training and testing via hidden Markov models (HMMs). These state-of-the-art methods [14, 1719] proved more efficiency for recognition accuracy using depth silhouette. However, it is still difficult to find best features from limited information such as joint points information especially during occlusions. It shows bad impact over recognition accuracy. Therefore, we needed to develop methodology which provides combined effects of full-body silhouettes and joints information to improve activity recognition performance.

In this paper, we proposed a novel method to recognize activities using sequence of depth images. During preprocessing steps, we extracted human depth silhouettes using background/floor removal techniques and tracked human silhouettes by considering rectangular box having body shape measurements (i.e., height and width) to adjust the box’s size. During spatiotemporal features extraction technique, a set of multifused features are considered as depth sequential history, motion identification, optical flow, joints angle, and joints location features. These features are further computed by principal component analysis (PCA) for global information and reduce dimensions. Then, these features are applied over K-mean for clustering and fed into a four-state left-to-right HMM for training/testing human activities. The proposed system is compared against the state-of-the-art approaches thus achieving best recognition rate over three challenging depth videos datasets as IM-DailyDepthActivity, MSRAction3D, and MSRDailyActivity3D datasets.

The rest of the sections of this paper are structured as follows. Section 2 describes the system architecture overview of the proposed system where depth maps preprocessing, feature extraction techniques, and training/testing human activities using HMM are explained. In Section 3, we explain experimental results by considering proposed and state-of-the-art methods. Finally, Section 4 presents the conclusion.

2. Proposed System Methodology

2.1. System Overview

The proposed activity recognition system consists of sequence of depth images captured by RGB-D video sensor, background removal, and human tracking from the time-sequential activity video images. Then, feature representation based on spatiotemporal features, clustering via K-mean, and training/recognition using recognizer engine are processed. Figure 1 explains the overall steps of proposed human activity recognition system.

2.2. Depth Images Preprocessing

During vision-based image preprocessing, we captured video data (i.e., digital and RGB-D) that retrieve both binary and depth human silhouettes from each activity. In case of binary silhouette, we received color images from digital camera which are further converted into binary images. In case of using depth silhouettes, we obtained the depth images from the depth cameras (i.e., PrimeSense, Bumblebee, and Zcam) to extract 320 × 240 depth levels per pixel [20, 34, 35]. These cameras deal with both RGB images and depth raw data.

For comparative study, it is examined that, in case of binary images, we can only obtain minimum information (i.e., black or white), while significant pixels values are dropped especially hands movement in front of chest or both legs crossing each other. However, in case of depth silhouettes, we received maximum information in the form of intensity values and additional body parts information (i.e., joint points), controlled during self-occlusion (see Figure 2).

Therefore, to deal with depth images, we remove noisy effects from background by simply ignoring ground line (i.e., y parameters) which acts as lowest value (i.e., equal to zero) corresponding to a given pair of - and -axis for floor removal. Next, we partitioned all objects in the frame using variation of intensity values in between consecutive frames. Then, we differentiate the depth values of corresponding neighboring pixels within a specific threshold and extract depth human silhouettes using depth center values of each object from the scenes. Finally, we apply human tracking by considering temporal continuity constraints (see Figure 3) between consecutive frames [21, 27], while human silhouettes are enclosed within the rectangular bounding box having specific values (i.e., height and width) based on face recognition and motion detection [3638].

2.3. Spatiotemporal Features Extraction

For spatiotemporal features extraction, we composed features as depth history silhouettes, standard deviation, motion variation among images, and optical flow for depth shape features, while joints angle and joints location features are derived from joints points features. Combination of these features explores more spatial and temporal depth-based properties which are useful for activity classification and recognition. All features are explained below.

Depth Sequential History. Depth sequential history feature method is used to observe pixel intensity information in overall sequence of each activity (see Figure 4). It contains temporal values, position, and movement velocities. Therefore depth sequential history is defined aswhere and are the initial and final images of an activity and is the duration of activity period.

Different Intensity Values Features. Standard deviation is computed as the sum of all the differences of the image pairs with respect to the time series (see Figure 5). It provides quite disperse output and hidden values (i.e., especially coordinates) having large range of intensities values

Depth Motion Identification. Motion identification feature mechanism is used to handle intra-/intermotion variation and temporal displacement (see Figure 6) among consecutive frames of each activity.

Motion-Based Optical Flow Features. In order to make use of the additional motion information from depth sequence, we applied optical flow technique based on the Lucas Kanade method. Basically, it calculates the motion intensity and directional angular values between two images. Figure 7 shows some samples of optical flows calculated from two depth silhouettes images.

Joints Angle Features. Due to similar or complex postures of different activities, it is not sufficient to just deal with silhouettes features; therefore, we developed skeleton model having 15 joints’ points information (see Figure 8).

However, joints angle features measure the directional movements of the th joints points between consecutive frames [39, 40] and aswhere indicates all three coordinate axes of the body joints with respect to consecutive frames [4143].

Joints Location Features. Joints location features measure the distance between the torso joint point and all other fourteen joints’ points in each frame of sequential activity as

Finally, we obtained the feature vector size of joints angle and joints location features as 1 × 15 and 1 × 14 dimensions. Figures 9(a) and 9(b) shows the 1D plots of both joints angle and joints location features for exercise, kicking, and cleaning activities.

2.4. Feature Reduction

Since spatiotemporal feature extraction using depth shape features consists of larger number of features dimension, thus PCA is introduced to extract global information [44, 45] from all activities data and approximate the higher features dimension data [46] into lower dimensional features. In this work, 750 principal components of the spatiotemporal features are chosen from the whole PC feature space and the size of feature vector becomes 1 × 750.

2.5. Symbolization, Training, and Recognition

Each feature vector of individual activity is symbolized based on K-mean clustering algorithm. However, a HMM consists of finite states where each state is involved in transition probability and symbol observation probability [47, 48]. During HMM, the underlying hidden process is observable by another set of stochastic processes that provides observation symbols. In case of training each activity, initially, HMM is trained having a size of codebook of 512. During HAR, trained HMMs of each activity are used to choose maximum likelihood of desired activity [4952]. However, sequence of trained data is generated and maintained by buffer strategy [31, 53]. Figure 10 describes the transition and emission probabilities of cleaning HMM after training.

3. Experimental Results and Descriptions

3.1. Experimental Settings

The proposed method is evaluated on three challenging depth videos datasets. First is our own annotated depth dataset known as IM-DailyDepthActivity [54]. It includes fifteen types of activities as sitting down, both hands waving, bending, standing up, eating, phone conversation, boxing, clapping, right hand waving, exercise, cleaning, kicking, throwing, taking an object, and reading an article. During experimental evaluation, we used 375 videos sequences for training and 30 unsegmented videos for testing. All videos are collected in indoor environments (i.e., labs, classroom, and halls) performed by 15 different subjects. Figure 11 shows some depth activities images used in IM-DailyDepthActivity dataset.

Second is public depth database as MSRAction3D dataset and third is MSRDailyActivity3D dataset. In the following sections, we explain and compare our method with other state-of-the-art methods using all three depth datasets.

3.2. Comparison of Recognition Rate of Proposed and State-of-the-Art Methods Using IM-DailyDepthActivity

We compare our spatiotemporal features method with the state-of-the-art methods including body joints, eigenjoints, depth motion maps, and super normal vector features using depth images. It is cleared in Table 1 that the spatiotemporal features achieved highest recognition rate as 63.7% over the state-of-the-art methods.

3.3. Recognition Results of Public Dataset (MSRAction3D)

The MSRAction3D dataset is a public dataset captured by a Kinect camera based on game consoles phenomenon. It includes twenty actions as high arm wave, horizontal arm wave, hammer, hand catch, forward punch, high throw, drawing X, drawing tick, drawing circle, hand clap, two-hand wave, side boxing, bending, forward kicking, side kicking, jogging, tennis swing, tennis serve, golf swing, and pickup and throw. The overall dataset consists of 567 (i.e., 20 actions × 10 subjects × 2 or 3 trails) depth map sequences. Also, this dataset is quite complex due to similar postures of different actions. Examples of different actions used in this dataset are shown in Figure 12.

To perform experimentation over MSRAction3D, we evaluated all 20 actions and examined their recognition accuracy performance based on LOSO (leave-one-subject-out) cross-subject training/testing mechanism. Table 2 shows the recognition accuracy of this dataset.

While some other researchers used MSRAction3D [2226, 28] dataset by dividing it into action set 1, action set 2, and action set 3 as mentioned in [22], we compare the recognition performance of spatiotemporal method with other state-of-the-art methods in Table 3. All methods are implemented by us using similar instructions provided by their respective papers.

3.4. Recognition Results of Public Dataset (MSRDailyActivity3D)

The MSRDailyActivity3D dataset is a depth activity dataset collected by a Kinect device based on living room daily routine. It includes sixteen activities as stand up, sit down, walk, drink, write on a paper, eat, read book, call on cell phone, use laptop, use vacuum cleaner, cheer up, sit still, toss paper, play game, lie down on sofa, and play guitar. This dataset includes 320 (i.e., 16 activities × 10 subjects × 2 trails) depth videos activities mostly operated in a room. These activities also involved human-object interactions. Some of the examples of the MSRDailyActivity3D dataset are shown in Figure 13.

Table 4 shows the accuracy performance of 16 different human activities that is obtained from the proposed spatiotemporal features method over the specific dataset.

Finally, we reported the comparison of recognition accuracy over the MSRDailyActivity3D dataset where the proposed method shows superior recognition rate over state-of-the-art methods in Table 5.

4. Conclusions

In this paper, we proposed spatiotemporal features based on depth images derived from Kinect camera for human activity recognition. The features include depth sequential history to represent the spatial-temporal information of human silhouettes in each activity, motion identification to calculate the change among motion in between consecutive frames, and optical flow to represent in the form of partial image to get optimum depth information. During experimental results, these features are applied over proposed IM-DailyDepthActivity, MSRAction3D, and MSRDailyActivity3D datasets, respectively. Our proposed activity recognition system shows superior recognition accuracy performance as 63.7% over the state-of-the-art methods using our depth annotated dataset. In case of public datasets, our method achieved accuracy performance as 92.4% and 93.2%, respectively. Our future work needs to explore more enhanced feature techniques for complex activities and multiple person interactions.

Competing Interests

The authors declare that there are no competing interests regarding the publication of this paper.

Acknowledgments

The research was supported by the Implementation of Technologies for Identification, Behavior, and Location of Human Based on Sensor Network Fusion Program through the Ministry of Trade, Industry and Energy (Grant no. 10041629). This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) (B0101-16-0552, Development of Predictive Visual Intelligence Technology).