The representation and selection of action features directly affect the recognition effect of human action recognition methods. Single feature is often affected by human appearance, environment, camera settings, and other factors. Aiming at the problem that the existing multimodal feature fusion methods cannot effectively measure the contribution of different features, this paper proposed a human action recognition method based on RGB-D image features, which makes full use of the multimodal information provided by RGB-D sensors to extract effective human action features. In this paper, three kinds of human action features with different modal information are proposed: RGB-HOG feature based on RGB image information, which has good geometric scale invariance; D-STIP feature based on depth image, which maintains the dynamic characteristics of human motion and has local invariance; and S-JRPF feature-based skeleton information, which has good ability to describe motion space structure. At the same time, multiple K-nearest neighbor classifiers with better generalization ability are used to integrate decision-making classification. The experimental results show that the algorithm achieves ideal recognition results on the public G3D and CAD60 datasets.

1. Introduction

Human action recognition is an interdisciplinary research direction in the field of computer vision, involving image processing, computer vision, pattern recognition, machine learning, and artificial intelligence. With the rapid development of digital image processing technology and intelligent hardware manufacturing technology, human action recognition has wide application prospects in intelligent video monitoring [14], natural human computer interaction [5, 6], smart home products [79], and virtual reality [10]. The popularity of human action recognition has led to several survey articles that have appeared in refs [1115]. These articles discuss various features and classifiers that have been used for human action recognition. In recent decades, computer vision research based on RGB image information is more and more abundant. However, RGB images usually provide only the apparent information of objects in the scene. When the foreground and background of an RGB image are similar in texture or color, it is difficult to perform accurate image recognition when relying on the limited RGB information. In addition, the appearance of the object described in the RGB image may not be robust to the common visual changes, such as illumination changes, which seriously hinder the use of the RGB-based visual algorithms in the real-world application environment.

With the continuous progress of science, Microsoft has released the Kinect sensor, which provides RGB information, scene depth information, and also human skeletal information in the scene. The depth image information is only related to the distance between the object and the camera and is not affected by illumination variation, environmental changes, and shadows. The human action sequence, in the form of multimodal sensor data, contains rich temporal patterns that can be used to distinguish between different action categories.

This paper makes full use of the multimodal information provided by the Kinect sensor to extract effective human action features and uses a multilearner integration strategy based on the K-nearest neighbor algorithm to construct a classification model.

The main contributions of this article are as follows:(1)The RGB modal information, based the histogram of oriented gradient (RGB-HOG), can maintain a good invariance to both geometric and optical deformation. The depth modal information, based on the space-time interest points (D-STIP), can keep dynamic stability of a human action feature, which maintains good local invariance characteristics of human movement. The skeleton modal information based on the joints’ relative position feature (S-JRPF) can describe the spatial structure information of human action well. Three different modal features can effectively represent human behavior and provide reliable behavior representation.(2)This work uses a multilearner ensemble to classify the prediction samples and makes full use of the learning biases of different learners to enhance the generalization ability of the overall model.

The rest of this paper is organized as follows. Section 2 presents the related works. Section 3 describes method framework of human action recognition. In Section 4, three different behavioral descriptors are introduced. We introduce the human action recognition algorithm in Section 5. Experimental results are given in Section 6 to verify the feasibility and performance of the proposed method. Finally, a brief conclusion and the future work are given in Section 7.

Although there have been many achievements in the research of action recognition, human action recognition in a real environment remains difficult. Video-based human action recognition can be divided into RGB data and RGB-D data-based human action recognition. Compared with RGB-D data, RGB data have more abundant appearance information and can better describe the interaction between human and object. However, RGB data are easily affected by background image, such as weather, light, shooting angle, and clothing, which makes it difficult to extract features from background image. Compared with the traditional RGB data, the RGB-D data are not affected by the change of illumination and the change of color and texture. More importantly, they can estimate the contour and skeleton of human body reliably.

Recently, with the development of RGB-D cameras, especially the Kinect sensor launched by Microsoft, recent research has focused on the use of deep images to solve the problem. Compared with traditional RGB data, the depth information provided by RGB-D images is more robust to changes in lighting conditions. The ever-growing popularity of the Kinect inertial sensors has prompted intensive research efforts on human action recognition. Since human actions are extracted from Kinect and inertial sensors, they can be characterized by multiple feature representations. By encoding the multiview features into a unified space, richer data are available for human action recognition.

In recent years, human action recognition based on video has made great progress. Many scholars have summarized and analyzed human action recognition methods based on RGB-D data [16, 17]. According to the different data, the method of human action recognition based on depth sensor can be divided into three parts: depth image sequence-based method, skeleton data-based method, and multimodal feature fusion-based method.

2.1. Depth Image Sequence-Based Method

In RGB-D video, depth data can be regarded as a spatiotemporal structure composed of depth information. The feature representation of action is the process of extracting features from this spatiotemporal structure. The method based on depth sequence mainly uses the action changes in the depth map of human body to describe the action. Sahoo et al. [18] applied depth history image to AlexNet to fine-tune the weights of the pretrained deep leaning architecture. To recognize the closely related actions, DHI alone is not sufficient. The 3D projected planes are extracted and trained separately on AlexNet for this purpose. Two types of projected planes are extracted in this work such as XT plane or side view and YT plane or top view of the action videos. The scores from both the learning techniques are fused to provide the final recognition score. Li et al. [19] proposed a real-time human action recognition system that uses depth map sequence as input. The system contains the segmentation of human, the action modeling based on 3D shape context, and the action graph algorithm. Xu et al. [20] proposed an effective method for human action recognition from depth images. A multilevel frame select sampling (MFSS) method is proposed to generate three levels of temporal samples from the input depth sequences first. Then, the proposed motion and static mapping (MSM) method is used to obtain the representation of MFSS sequences. After that, this paper exploits the block-based LBP feature extraction approach to extract feature information from the MSM. Finally, the fisher kernel representation is applied to aggregate the block features, which is then combined with the kernel-based extreme learning machine classifier. Chen et al. [21] proposed a human action recognition method by using depth motion maps (DMMs). Each depth frame in a depth video sequence is projected onto three orthogonal Cartesian planes. Under each projection view, the absolute difference between two consecutive projected maps is accumulated through an entire depth video sequence forming a DMM. An l2-regularized collaborative representation classifier with a distance-weighted Tikhonov matrix is then employed for action recognition. The developed method is shown to be computationally efficient allowing it to run in real time. The above methods identify actions by analyzing and modeling the motion information in the depth sequence. However, because RGB-D video itself has more noise and lacks relevant appearance and texture information, the depth sequence-based method has not achieved ideal results in many datasets.

2.2. Skeleton Data-Based Method

The method of action recognition based on skeleton data is an important direction in the field of depth data research. Based on the skeleton sequence of the human body, this method uses the changes of human joints between video frames to describe the movement, including the changes of joint position and appearance. The skeleton model of the human body can be quickly and accurately estimated from the depth data, so the method of human posture estimation based on RGB-D data is widely used. Wan et al. [22] extracted the orientation vectors from several groups of skeleton joints and used a stacked residual bidirectional long-short term memory (LSTM) network to build modal. Liu et al. [23] proposed a new action recognition LSTM network based on skeleton data, that is, global context aware attention LSTM network. By using the global context memory unit, the network can selectively focus on the information nodes in each frame. In order to further improve the attention ability of the network, a recursive attention mechanism is introduced, through which the attention performance of the network can be gradually improved. Liu et al. [24] proposed a method of human motion recognition based on the skeleton data collected by depth sensor. In order to make full use of the skeleton data of human body, the movement features such as position, speed, and acceleration are extracted from each frame to capture the dynamic and static information of human action. Finally, k-nearest neighbor algorithm based on weighted voting method is used to realize action recognition, and pose specificity is used as voting weight. Phyo et al. [25] used the skeleton motion history image to build a deep learning model to recognize human behavior. The experimental results show that this method can achieve high recognition accuracy with low calculation cost in all kinds of environments. Because the skeleton information is not affected by background light and other factors, it has certain robustness and can be quickly and accurately estimated from the depth data. In recent years, with the development of deep learning, the application of convolutional neural network (CNN), recurrent neural network (RNN), LTSM, and other frameworks has brought progress to the skeleton-based motion recognition, which will make greater progress in the future.

2.3. Multimodal Feature Fusion-Based Method

Each feature extraction method has its own advantages and is independent of each other. If different features can be fused effectively, a more discriminative feature vector can be obtained, and the recognition performance will be improved. Therefore, in recent years, the fusion method has attracted the attention of scholars. There are two kinds of fusion methods: feature level fusion and decision level fusion.Feature lever fusion is an early fusion method. Firstly, the feature vectors are extracted by different methods, and then the extracted features are standardized, selected, or transformed, so as to generate a new feature vector with more discrimination. Zhang et al. [26] proposed a method of action recognition which combines gradient information and sparse coding. Firstly, the feature of coarse depth skeleton is extracted by using depth gradient information and skeleton joint distance. Then, sparse coding and maximum pool are combined to refine the rough coarse depth skeleton features. Finally, the random decision forests are used to identify the actions. El Din El Madany et al. [27] proposed a human action recognition framework by using global locality that preserves canonical correlation analysis (GLPCCA); their work fuses depth and RGB modalities, which includes the hierarchical pyramid of depth motion map deep convolutional neural network (HP-DMM-CNN) used for the depth images and the optical flow convolutional neural network to model the RGB videos. Guo et al. [28] proposed a new unsupervised feature fusion method for human action recognition, termed the multiview Cauchy estimator feature embedding (MCEFE). By minimizing empirical risk, MCEFE integrates the encoded complementary information in multiple views to find the unified data representation and the projection matrices. To enhance robustness to outliers, the Cauchy estimator is imposed on the reconstruction error. Asteriadis et al. [29] presented a novel, multimodal human action recognition method to handle a sensing device’s noise and person-specific characteristics. Each action is represented by a basis vector and spectral analysis is performed on an affinity matrix of new action feature vectors. Using modality-dependent kernel regressors for computing the affinity matrix, the complexity is reduced by forming robust low dimensional representations. Gao et al. [30] proposed pyramid appearance and global structure action descriptors on both RGB and depth motion history images as a way to construct a model-free method for human action recognition. In this algorithm, they first construct a motion history image for both the RGB and depth channels while simultaneously depth information is employed to filter RGB information; next, different action descriptors are extracted from the depth and RGB MHIs to represent these actions, and then a multimodality information collaborative representation and recognition model is built in which multimodality data are put into an objective function naturally. In this method, information fusion and action recognition are done together, with the goal to classify human actions.Decision level fusion is different from feature level fusion. First, the classifier trained by each method outputs the classification results, and then the classification results are fused to get the final classification results. In order to effectively combine the joint, RGB, and depth information of Kinect sensor, Seddik et al. [31] proposed local and global support vector machine model using multilayer fusion scheme to connect different features. Malawski and Kwolek [32] proposed a new motion description called joint motion history context, which is based on depth and bone data. The decision level fusion method based on support vector machine and multilayer perceptron is used to effectively fuse the motion mode information of multiple feature sets. Imran and Raman [33] proposed a multimodal action recognition method based on deep learning paradigm. Firstly, for RGB video, a new image-based descriptor is proposed, which is called stacked dense flow difference image (SDFDI), which can capture the temporal and spatial information in video sequence. Then, they train various kinds of deep two-dimensional CNN and compare SDFDI with the latest image-based representation. Secondly, aiming at skeleton flow, a data enhancement technology based on 3D transformation is proposed to train deep neural network on small dataset. A RNN model based on bidirectional gating recursive unit (BiGRU) is proposed. Thirdly, for the inertial sensor data, a data enhancement method based on Gaussian white noise jitter is proposed, and the action classification is combined with the deep one-dimensional CNN network. The outputs of these three heterogeneous networks are combined by multiple model fusion methods based on fraction and feature fusion.

Although the existing action recognition method using depth information has made great progress, the reliability of recognition is still unsatisfactory for practical engineering. The primary reason is that human action recognition has great within-class differences but nonobvious between-class differences, and distinguishing the differences of human movement speed requires higher computational complexity.

3. Method Framework

In order to improve the robustness and practicability of the recognition system and to make full use of the advantages of different features, we use the different modality data provided by the Kinect sensor. Three kinds of features are used as human action descriptors, and then the multilearner ensemble algorithm is used to recognize the action. The system flow is shown in Figure 1. This method preserves the efficiency of performing computation on simple features and also guarantees robustness of the recognition system and the discrimination ability of the action feature. The system framework includes the following steps:(1)Obtain synchronous RGB image, depth image, and skeleton data from the Kinect sensor.(2)Transform the RGB image data to gray image data to reduce the scale of data processing, use classical filtering methods to reduce image noise, and then extract a histogram of oriented gradients from the processed image. Space-time interest points are extracted as features from the depth image data, and data describing the relative position of joints from the 3D skeleton data are also extracted as features.(3)Classify the action described in the above three features using three different k-nearest neighbor classifiers based on three different distance measurement formulas. Select the actions with highest similarity results from each classifier and set the action class that has the largest number of samples in as the final prediction result.

4. Feature Extraction

The output of Microsoft Kinect camera is a multimodal signal, which can provide RGB video, depth mapping image sequence, and skeleton joint information at the same time. Thus, it can effectively overcome the loss of depth information and spatial position relationship between objects due to the traditional RGB camera projecting the 3D physical world onto the 2D image plane. The characteristics of different modes are independent but complementary. In order to obtain better recognition performance, this paper effectively fuses the features under multimodality and designs a description vector with high discriminability, that is, using visual information, the depth, and skeleton to improve the recognition results. In this section, three different behavioral descriptors are introduced.

4.1. RGB-HOG

Histogram of oriented gradients (HOG) is a feature descriptor for object detection in computer vision and image processing [34]. HOG descriptors can effectively extract the local gradient and direction information of the image to describe the key characteristics of human behavior. The traditional HOG feature extraction process is a pyramid structure, which consists of three layers: cell, block, and image. The top and bottom steps are as follows: (1) construct the feature vector of cell; (2) construct the feature vector of block; and (3) construct the feature vector of image. In the process of constructing cell histogram of traditional HOG operator, the influence of neighborhood pixel gradient is not considered, so the “aliasing effect” is easy to appear. To solve this problem, Dalal et al. [34] used the block overlap method, but the calculation is large; Pang et al. [35] used the linear interpolation method to adjust the voting rights of the pixels in the block, but it does not consider the influence of the pixels in the block neighborhood. In fact, based on the cell, only part of the gradient information of its neighborhood is used, which leads to the problem of insufficient information utilization. In this paper, based on the cell, the neighborhood range of cell is planned, and the voting method of neighborhood pixels is further improved. The histogram of the original cell is modified by the gradient amplitude of all pixels in the neighborhood of the cell. The HOG feature extraction algorithm flow is shown in Figure 2.Step 1: input image and region of interest extraction.In the research of human behavior recognition, the region of interest (ROI) is selected as a smaller region from an image. This region is the most important part of human motion analysis. The region can be cropped from the full-size image to reduce processing time and increase accuracy. In this paper, first an input image is analyzed using a region of interest detection algorithm to predict the approximate position of the target and to select the minimum rectangular boundary around the target as the region of interest. Next, a series of operations is carried out, including feature extraction in the ROI corresponding to the original image.Step 2: image graying and gamma correction.Due to the varied factors of image acquisition devices and environments, image of faces may be unclear and prone to either failed detection or false detection. Consequently, it is necessary to preprocess the collected human image, mainly to deal with the situations where the image is either not luminous enough (too dark) or too luminous (too light). There are two processes used to deal with this issue: image graying and gamma correction.(a)Image graying: for a color image, the RGB component is converted into a grayscale image. The conversion formula is as follows:(b)Gamma correction: in the case of uneven illumination, gamma correction can be used to improve or reduce the overall brightness of the image. In practice, we can use two different methods to standardize gamma, employing either the square root or logarithm. In this paper, we use the square root method. The formula is as follows (where γ = 0.5):Step 3: gradient calculation.For the normalized image, the gradient and gradient direction are obtained via the following equations:Step 4: histogram of oriented gradients.The gradient direction image is divided into N cells, with 8 × 8 = 64 pixels as one cell. Adjacent cells do not overlap. The gradient direction of each pixel is counted in each cell. All the gradient directions are divided into 9 bins (i.e., 9-d eigenvectors) as the horizontal axis of histogram, and the cumulative value of gradient value corresponding to the angle range is the vertical axis of histogram.Then, the original histogram vector value is modified. Suppose is any cell and is any pixel in its neighborhood. The size of area is , and the coordinate of the middle point is . The coordinates of the pixel are , where is the gradient direction value of and the gradient amplitude is . It is assumed that lies between the direction blocks and of . Let the correction coefficients of to the histogram of and direction block be and , respectively. The original histogram vector values of and are , . The trilinear interpolation method is used for correction, and the correction coefficients and arewhere is the angle difference between adjacent cell blocks.After correction, the histogram vectors and of histogram are as follows:According to formula (7), the histogram of is modified by using the gradient information of all pixels in neighborhood. In the same way, we modify the HOG of other cells of the original image to get the modified HOG vectorStep 5: histogram normalization of overlapping blocks.If there is a large variety of illumination and backgrounds in the image, the range of the gradient value will be large, so good feature standardization is very important to improve the detection rate. There are many ways to standardize, most of which define a cell as a set of blocks and then standardize each block separately. Take the 2 × 2 cells adjacent to each other as a block. The 8 × 8 pixel is a cell, and the red, blue, yellow, pink, and green boxes are all blocks. That is, the 2 × 2 cells in each box form a block. Each block is 16 × 16 pixels. There are overlaps between adjacent blocks, so the information of adjacent pixels is effectively used, which is very helpful to the detection results.Next, each block is standardized. There are four cells in a block. Each cell contains 9-dimensional feature vectors, so each block is represented by 4 × 9 = 36-dimensional feature vectors. In this paper, L2 norm is used for feature standardization. Let be a very small normalized constant.After normalizing the histogram of overlapped blocks, the feature vectors of all blocks are combined to form the HOG feature.Step 6: output HOG features.

4.2. D-STIP

The action recognition method based on space-time interest points is one of the more popular action recognition methods at present. It describes the action by detecting the interest points whose pixel values have significant changes in the spatiotemporal neighborhood and extracts the underlying features from them.

Because the space-time interest points are extracted from local features, which are not easily affected by illumination, motion characteristics, or background changes, this method has improved robustness over less localized methods.

In this paper, we implement the representation of space-time interest points and space-time words based on depth image. This method first extracts the accurate space-time interest points from the samples and then extracts the local neighborhood features of the interest points. Next a space-time codebook based on the feature of the interest points is established, and a statistical histogram of the interest points based on the space-time codebook is obtained. The D-STIP extraction flowchart is shown in Figure 3.Step 1: Dollar STIP detection.Laptev extended the 2D Harris corner [36] to the 3D Harris corner [37] and used them as the significant changing points in the spatiotemporal domain. Firstly, the video sequence is represented in the linear space asThen, the matrix can be obtained aswhere is the Gaussian kernel function, is the spatial factor, is the temporal factor, and is the depth video image sequence. The three eigenvalues of the matrix N, , and correspond to changes in the depth video sequence in the two spatial directions and on temporal domain t, respectively. When these values are all large, it means that the video changes significantly along all three directions, and therefore this point is a space-time interest point.Laptev defined the response function of interest points aswhere and are determinants and traces of matrices, respectively, and k is a coefficient and usually takes the value of 0.005. The function value H obtains the local maximum at the point of interest.3D Harris corner detection is very sensitive to movements that changes the direction of speed, such as walking, running, and waving, but for other movements, such as rotation and periodic movement, there is often no point of interest detected.The interest points detected by the 3D Harris spatiotemporal corner are too sparse. Although we expect sparsity to some extent, if feature points are too sparse, it means that there are too few underlying features. This can negatively affect the recognition results. Dollar et al. [38] proposed a new method for interest point detection, which makes the extracted interest points more dense. The response function H is calculated by the separable linear filter:where is a 2D Gaussian smoothing kernel function for spatial filtering: and are orthogonal components of one-dimensional Gabor function, which are used for filtering in time domain:where and the response function H has only two parameters and , corresponding to space and time scales, respectively. The point whose response function H has the local maximum value is detected as the point of interest if it is greater than a certain threshold value. The number of interest points detected can be controlled by the threshold value. In order to solve the problem of scale changes, the method of multiscale combination can be used to detect interest points.However, the noise points in the depth image will also have a greater response to the kernel function in the space-time domain, so they are mistakenly detected as points of interest. The wrong interest points will introduce a lot of errors to the subsequent feature description, which will seriously reduce the description ability of spatiotemporal interest points. In this paper, a correction filter is applied to the detected interest points to reduce noise interference.The noise of depth image can be roughly divided into three categories: one is generated by depth sensing equipment, which appears randomly in the whole depth image. This kind of random noise generally appears less and has little influence on the detection of interest points. The second kind of noise appears at the edge of the scene object because of the nature of structured light imaging. The depth of noise often jumps between the foreground and background on both sides of the edge. The third is due to the problems of reflective material on the surface of the object, fast movement, and so on. The “holes” appear on the depth image, that is, the loss of the depth value on the image (the pixel value is zero). The second and third kind of noise will produce a lot of interference to the detection of interest points, and they are difficult to be removed by ordinary spatial smoothing filtering. Generally speaking, the disturbance frequency of noise signal is much faster than the motion frequency of human body, and it may appear in consecutive frames of human motion time segment. Based on this, we can calculate the average time of noise disturbance and then filter the obtained interest points. The correction function of interest point is shown in the following formula:where is the number of times the noise signal jumps in the whole movement period and is the duration of the ith jump. The interest point correction function is the ratio of noise signal in pixel t to the whole time period. It gets higher values at the real moving pixels and lower values at the noise points. Therefore, pixels (noise points) with low ratio can be filtered out by setting a threshold value.After detecting the interest points, it is necessary to select appropriate local feature descriptors to represent the interest points.Step 2: feature description of interest points.Dollar et al. [38] proposed the concept of the cuboid for the detection of interest points. The cuboid is a cuboid video block centered on interest points, whose edge length generally depends on the detection scale of the interest points. Using cuboid descriptors to represent interest points can represent interest points along with their neighborhood information.Firstly, three kinds of transformations are performed with cuboid detection: (1) pixel value normalization; (2) for each pixel , the gradient in different directions is calculated, and three cuboid matrices are obtained; (3) the Lucas–Kanade optical flow [39] is calculated for the adjacent frames, and two cuboid matrices are obtained. Therefore, for each point of interest in the set of extracted points of interest , we can calculate its feature description as .Step 3: the establishment of the space-time codebook.Because of the difference of the performers’ wearing, action mode, and amplitude, the same action will have different interest points in different videos. However, the features of these interest points are similar and provide the essential description of the temporal and spatial features of the action. After the feature representation of interest points, we need to use the feature vector to represent different actions, that is, to model the action. The most common modeling method to model the interest points is the bag of video words (BoVW) method.A k-means clustering algorithm is used to cluster the feature set extracted from the training dataset. The number of clustering centers is selected in the experiment. The generated clustering centers are regarded as the spatiotemporal words , m is the feature dimension, and fi is the ith feature component of the spatiotemporal words. The set of all spatiotemporal words is , where n is the number of clustering centers. For different action videos, the spatiotemporal codebook corresponding to different action categories is trained according to the above steps in the training set. In the subsequent action recognition process, the interest points are classified by calculating the distance between the feature of interest points and spatiotemporal words.The statistical histogram of interest points based on the spatiotemporal codebook is obtained by counting the categories of all interest points in the video, where n] is equal to the dimension of spatiotemporal codebook and hi is the frequency of the ith spatiotemporal word in the video. Finally, the histogram is used as the video descriptor:

4.3. S-JRPF

Skeleton joint point is the visual salient point of human body, and its movement in 4D space can reflect the semantic information of action. The research of joint-based motion recognition can be traced back to Johansson's early work [40]. Their experiments show that most of the movements can be identified only according to the position of joint points. This idea has been adopted by a large number of subsequent researchers and has gradually formed an important branch of human motion recognition methods.

With the release of the Microsoft Kinect sensor, it is convenient to get the depth map of the scene and the 3D skeleton of the human body. Compared with the feature of deep image extraction, the 3D skeleton data provided by Kinect has only 20 joint points as information of the human body. After feature extraction, the feature dimension will be lower and thus computations are smaller, which is beneficial to real-time performance of action recognition algorithms. For three-dimensional skeleton motion data, we first need to express the motion through the feature expression before we can correctly identify the motion. We do so by using the Kinect. By using the coordinate information of the 20 joint points from the Kinect, we can find a good representation of the human body.

Based on the joints’ modal data, this paper presents the spatial distribution feature of joint projection to represent human motion. Firstly, the 3D skeleton data of each frame are collected and projected in three planes (XOY plane, YOZ plane, and XOZ plane) to obtain the position distribution of projection points of single frame 3D skeleton joint data on different projection planes. The projection of the joint points of the human body is shown in Figure 4.

Then, the joint points on the three projection planes are represented in polar coordinates:

Finally, the polar coordinates of the projection points on the three projection planes are spliced as the feature vectors of the frame. In order to make the feature data fall in [0, 1], the joints’ relative position feature can be obtained by using minimax normalization since the skeleton modal information is invariant under translation transformations, scale transformations, and rotation transformations. Therefore, the feature view in joints’ mode can be expressed as

5. Recognition Algorithm

Experiments show that the classification performance of the learning system is better than that of each basic classifier, so the effectiveness of ensemble learning is proved. Dietterich [41] listed ensemble learning as the top four research directions of machine learning. Integrated learning is to build a strong classifier with excellent classification performance and generalization ability. In the traditional classification algorithm, SVM classification algorithm and KNN classification algorithm have better classification effect than other traditional classification algorithms.

However, the classification effect of the base classifier is not stable. Simply using the base classifier to classify the data, it is easy to make the classification result overfit. Combining the base classifier according to the combination strategy produces a strong classifier, and the classification performance of the strong classifier is better than that of each base classifier. In order to generate a better classification method, this paper will build an ensemble KNN multiclassifier model.

The KNN method is based on analogical learning, which is a nonparametric classification technology. It is very effective in pattern recognition based on statistics. It can obtain high classification accuracy for unknown and non-normal distribution and has the advantages of robustness and clear concept.

The basic ideas are as follows: feed in new data without a class label, extract the feature from the new data, and compare the new feature to the feature of each sample in the training set; then select the class labels of the k nearest (most similar) samples and count the number of the label occurrences. The class with the highest occurrence count is determined to be the class of the new data.

Now, we expect to use KNN classification rules to complete the correct classification of test data point . By finding k nearest neighbors near the test sample point , the test sample point is predicted to be the category with the most k nearest neighbors. Among the N training samples, training samples belong to category , training samples belong to category , …, training samples belong to category . If belong to categories , respectively, then the discriminant function can be defined as

The decision rule is ifthen

To classify specific actions, we can search the training set for the K actions that are nearest to the new action and determine the class of the new action based on the classes of these K actions. This paper proposes an integrated classification method using multilearners based on a training set of multimodal features, which is more effective to identify the new action. It fully utilizes the biasing effects from different learners and therefore enhances the generalizing capability of the learning. The implementation sequence of the algorithm is as follows:Step 1: describe the training sets of action features with different modal information separately.where , , and are training sample sets and the number of samples in each training set is N.Step 2: determine the vector representation of the action in three kinds of modal description.where ,, and are the three feature vector representations of the action Θ to be predicted. The sample dimensions of ,, and are , and , respectively.Step 3: select the Topk1, Topk2, and Topk3 actions that are nearest to the action to be predicted from the three training sets using different distance measurement formulas, separately. The equations to compute the similarity for various models arewhere is the Euclidean distance metric, is the Manhattan distance metric, is the Mahalanobis distance, and V−1 is the covariance function.Step 4: compute the weight of each class of the Topk1 + Topk2 + Topk3 actions that are nearest to the action to be predicted:where ,, and are the feature vectors of action described in various models. is the attribute function of the class. If belongs to class Cj, takes 1; otherwise, it takes 0. is the weight coefficient of the nearest neighbor of the sample, is the reciprocal of the distance, and ε is the smaller positive number which is not 0.Step 5: compare the class weights and assign the action to be predicted to the class with the largest weight.

6. Experiments and Results

This section provides the experimental results and analysis of our algorithm as applied to the G3D dataset and Cornell Activity Dataset 60.

6.1. Figure of Merit

Cross-validation is adopted in the experiments to train the classification model and to test its performance. In addition, the precision, recall, and F-measure are used to evaluate the effectiveness of the algorithm, as illustrated in the following equations:

In a biclassification, TP is the number of positive samples that are correctly predicted by the classification model, FP is the number of negative samples that are classified as positive by the model, and FN is the number of positive samples that are classified as negative by the model. These formulas can be extended to multiclass classifications.

6.2. Datasets

The G3D dataset contains 20 categories of human actions, each performed by 10 persons. The 20 category actions are punch right, punch left, kick right, kick left, defend, golf swing, tennis swing forehand, tennis swing backhand, tennis serve, throw bowling ball, aim and fire gun, walk, run, jump, climb, crouch, steer a car, wave, flap, and clap. The Cornell Activity Dataset 60 (CAD60) contains 12 actions, which are performed by 4 persons in 5 different environments. These actions are rinsing mouth, brushing teeth, wearing contact lens, talking on phone, drinking water, opening container, chopping, stirring, talking on couch, relaxing on couch, writing on white board, and working on computer. The actions in the G3D and CAD60 datasets contain image information in three different kinds of modals: RGB image, depth image, and skeleton joint data, as illustrated in Figures 5 and 6. In the experiment, we randomly divide all videos into training data and test data according to the ratio of 7 : 3. The final test result is the average of 10 test results. In the experiment, we set Topk1, Topk2, and Topk3 to 5.

6.3. Experiments and Results

In this section, we validate the feasibility and efficiency of this paper’s method in two experiments. In the first, we test the recognition rate and the precision, recall, and F-measure on the G3D and CAD60 datasets based on a single feature and this paper’s algorithm. In the second experiment, we compare our method to other algorithms.

We present the result of this paper’s method in Experiment 1 with the confusion matrix. The (i, j) element of the matrix is the percentage of action of class i that are classified as the action of class j. Therefore, the greater the diagonal elements, the better the classification result.

Figures 710 illustrate the recognition rates using the single modal feature on the G3D dataset with confusion matrices. Figure 10 shows the recognition rate resulting from this paper’s method using multimodal information. From the above figures, it can be observed that the 20 categories of action recognition rates based on multimodal features are all higher than those using the single modal feature. For the six actions of defend, throw bowling ball, aim and fire gun, wave, flap, and clap, the accuracy is 100%. Figures 1114 illustrate the recognition rate using the single modal feature on the CAD60 dataset with confusion matrices. Figure 14 shows the recognition rate of this paper’s method using multimodal features on the CAD60 dataset. Through comparison, it can be found that this paper’s method achieves a good recognition rate of 94% on the CAD60 dataset, with 100% accuracy for the actions of drinking water, stirring, relaxing on couch, and writing on white board. The results of experiments show that the integrated KNN modal based on multimodal data is better than single KNN model based on single modal data. Single KNN model is difficult to meet the needs of human behavior prediction.

In addition, we present, in terms of precision, recall, and F-measure, the recognition rates using the single modal feature and multimodal features in Table1. The recognition rates of this paper’s method using multimodal features are higher than those of the methods using the single modal feature.

In the second experiment, we compare this paper’s method to other classical machine learning methods. Table 2 shows the comparison of this paper’s algorithm to boosting, bagging, support vector machine (SVM), and artificial neural networks (ANNs). From the results in Table 2, it can be observed that the integrated multilearner recognition algorithm based on multimodal features achieves the highest recognition rate of 94%. It can be seen from the table that the combined nearest neighbor classifier based on multimodal features has better classification accuracy, mainly because our proposed algorithm is a behavior recognition algorithm based on multimodal feature fusion, which can make full use of the complementarity between different models. In general, the accuracy of the combined nearest neighbor classifier based on multimodal data is higher than that of the original single nearest neighbor classifier.

Table 3 compares the average class accuracy of our method with results reported by other researchers. Compared with the existing traditional machine learning approaches, our method shows much better performance, outperforming the state-of-the-art approaches. Note that a precise comparison between the approaches is difficult, since experimental setups, e.g., different strategies in training, slightly differ with each approach. In addition, compared with random dropout-based CNN method only using RGB data, our method also achieves better results. Dropout method is to set the weights of some hidden layer nodes of neural network to 0 during training, which is used to solve the model overfitting problem caused by too few training samples. On the basis of dropout, we further improve it and add a layer of randomization process to realize random dropout, so as to further prevent the overfitting phenomenon of the model. Therefore, when the training sample data are small, the multilearner recognition method based on multimodal features is better than the deep learning method.

7. Conclusion

A human action recognition method based on multimodal features is proposed in this paper. Through the Kinect sensor, three modal information is acquired for each image, and the RGB-HOG feature, D-STIP feature, and S-JRPE feature are extracted. An integrated learning strategy with multilearners is adopted, which fully utilizes the biasing effects from different learners. The method achieves good recognition rates on standard public datasets and is robust in real time. Although the method presented herein achieved good experimental results on public datasets, there still remain many issues in action recognition, calling for deeper investigations. Generally, a large amount of tagged video training samples are necessary for the classifier to achieve a good generalizing capability. This requires a lot of manual tagging work, and thus practical modeling can be difficult. It is thus a very valuable direction to investigate how to enhance the learning system’s performance utilizing the abundant untagged video samples at hand in the public data.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.


The abstract of the manuscript was already presented as conference proceedings in Global Intelligence Industry Conference (GIIC 2018).

Conflicts of Interest

The authors declare that they have no conflicts of interest.


This research was funded by the National Natural Science Foundation of China (grant nos. 61673249, 61672204, U1805263, and 61662025), the Natural Science Foundation of Anhui Province (grant no. 2008085MF202), the Guidance Project of Science and Technology of Xiamen (grant no. 3502Z20179038), the Research Foundation of Education Bureau of Hunan Province (grant no. 16C1311), the Natural Science Foundation of Zhejiang Province (grant nos. LY20F030006 and LY20F020011), the Natural Science Research Project of Universities of Anhui Province (grant no. KJ2019A1121), the Research Development Project Fund of Hefei University (grant no. 18zr19zda), the Key R&D Program of Shanxi Province (grant no. 201903D421050), and the Key Teaching and Research Project of Hefei University (grant no. 2018hfjyxm09).