Theory and Applications of Complex Cyber-Physical InteractionsView this Special Issue
Research on Discriminative Skeleton-Based Action Recognition in Spatiotemporal Fusion and Human-Robot Interaction
A novel posture motion-based spatiotemporal fused graph convolutional network (PM-STGCN) is presented for skeleton-based action recognition. Existing methods on skeleton-based action recognition focus on independently calculating the joint information in single frame and motion information of joints between adjacent frames from the human body skeleton structure and then combine the classification results. However, that does not take into consideration of the complicated temporal and spatial relationship of the human body action sequence, so they are not very efficient in distinguishing similar actions. In this work, we enhance the ability of distinguishing similar actions by focusing on spatiotemporal fusion and adaptive feature extraction for high discrimination information. Firstly, the local posture motion-based attention (LPM-TAM) module is proposed for the purpose of suppressing the skeleton sequence data with a low amount of motion in the temporal domain, and the representation of motion posture features is concentrated. Besides, the local posture motion-based channel attention module (LPM-CAM) is introduced to make use of the strongly discriminative representation between different action classes of similarity. Finally, the posture motion-based spatiotemporal fusion (PM-STF) module is constructed which fuses the spatiotemporal skeleton data by filtering out the low-information sequence and enhances the posture motion features adaptively with high discrimination. Extensive experiments have been conducted, and the results demonstrate that the proposed model is superior to the commonly used action recognition methods. The designed human-robot interaction system based on action recognition has competitive performance compared with the speech interaction system.
With the development of artificial intelligence technology, human-robot interaction technology has become a research hotspot. Compared with speech and image signals, vision-based human-robot interaction technology is more stable, and it attracts a lot of research interest. The key to human-centered visual interaction technology is to understand human activities  and human social behaviors . Therefore, action recognition plays an important role in the field of human-robot interaction . The two main approaches of human action recognition are RGB-based and skeleton-based. The RGB-based method makes full use of the image data and can obtain higher performance in the recognition rate. However, this method usually needs to process every pixel in the image to extract features. Therefore, high-cost computing resources are required and real-time processing can hardly be achieved. It is also vulnerable to poor lighting conditions and background noise. In the skeleton sequence method, the 2D or 3D coordinates are expressed as human joint positions. Due to the limited number of joints in the human skeleton, only a few dozen, some modest computing resources would be enough for real-time applications. It is also robust to dynamic environments and complex backgrounds. Many widely available devices are suitable for extracting human skeleton features, such as Microsoft Kinect, OpenPose , and CPN .
The conventional deep learning-based methods convert the skeleton sequence as a set of joint vector sequences, input them to RNNs , or extract features by feeding 2D pseudoimages representing skeleton sequences into CNNs , and then predict the action classes. However, neither the joint vector sequence nor the 2D pseudoimage can represent the correlation between human joints effectively. Recently, graph convolutional neural networks (GCNs) have extended the convolution operation from 2D image structure to graph structure and have shown good performance in many applications. Yan et al.  used GCNs for the first time in skeleton-based action recognition and proposed a spatial-temporal graph model. Subsequently, the methods for optimizing spatial feature extraction were proposed. Yang et al.  presented a finite-time convergence adaptive fuzzy control method for a dual-arm robot with an unknown number of kinematics and dynamics. Shi et al.  used adaptive graph convolutional layer and attention mechanism to increase the flexibility of the model, first-order joint information, second-order bone information, and motion information as inputs to construct multistream networks. Liu et al.  proposed multiscale aggregation across spatial and temporal dimensions effectively to eliminate the importance of neighbor nodes for long-range modeling. Yang et al. proposed a personalized variable gain control with tremor attenuation for robot teleoperation  and used adaptive parameter estimation and control design for robot manipulators with finite-time convergence . Peng et al.  used high-order representations of skeleton adjacency graphs and dynamic graph modeling mechanisms to find implicit joint correlations. Obinata and Yamamto  modeled the spatiotemporal graph by adding extra edges on the interframe to extract the relevant features of the human joints. However, all these methods ignore the fusion of posture motion and skeleton joint features in the temporal domain.
In the existing research work, the spatial information and motion information of the spatiotemporal graph are not fused to achieve end-to-end training effectively. The proposed novel posture motion-based spatiotemporal graph convolution networks (PM-STFGCNs) use the posture motion-based spatiotemporal fusion (PM-STF) module to perform feature fusion of motion and skeleton representation in the spatiotemporal domain for enhancing skeleton features adaptively. The defined local posture motion-based attention module (LPM-TAM) is used to constrain the disturbance information in the temporal domain and learn the representation of motion posture. The introduced local posture motion-based channel attention module (LPM-CAM) is employed to learn the strong discrimination representation between similar action classes in order to enhance the ability to distinguish fuzzy action. Extensive experiments have been performed on two large-scale skeleton datasets. Compared with common methods, the proposed method can further improve the recognition performance which combines with the method of optimizing the spatial graph convolution only. In addition, a human action recognition interactive system was designed to compare with speech interaction.
The main contributions of our methods are the following:(1)A novel local posture motion-based attention module (LPM-TAM) filters out low motion information in the temporal domain that helps to improve the ability of relevance motion feature extraction(2)Local posture motion-based channel attention module (LPM-CAM) is employed to enhance the ability to distinguish similar actions for learning the strong discriminative representation adaptively between different action classes(3)The posture motion-based spatiotemporal fusion (PM-STF) module is used, which integrates LPM-TAM and LPM-CAM to effectively fuse the spatiotemporal feature information and extract high-discriminative feature for improving the ability to distinguish similar actions(4)The effectiveness of the proposed method has been verified through extensive experiments, compared with other common methods to evaluate the competitiveness of the proposed method and applied in humanoid robots successfully to verify that action interaction is better than speech interaction
2. Related Work
2.1. Spatial Graph Convolution Networks
The spatial-temporal graph convolutional neural network  represents the connection relationship of the joints with the self-connected identity matrix and the adjacency matrix . In the case of a single frame, the convolution operation of the spatial dimension is performed as follows:where is the feature map with input dimension of () tensor, N is the number of joints, is the adjacency-like matrix, denotes the vertex in the subset of the vertex , and is the normalized diagonal matrix, where . K represents the numbers of different subsets in spatial dimension based on spatial distance partition strategies. There are three different subsets, namely, . represents the connection of the vertex itself, represents the connection of the centripetal subset, and represents the connection of the centrifugal subset. is convolution weight. is a spatial attention feature map of the dimension, which denotes the importance of each joint. is the multiplication of the corresponding elements of the matrix, which means that it can only affect the vertices connected to the current target.
2.2. Temporal Graph Convolutional Networks
The literature  proposed a temporal attention module, and the attention coefficient is calculated as follows:where is an input feature map. is an average pooling operation. is a convolution operation, and the weight matrix , where S is the size of the convolution kernel. refers to the Sigmoid activation function. The attention feature map , which denotes the importance of the skeleton graph at a temporal dimension, and refers to the length in time.
The literature  defined the temporal graph convolution based on a simple strategy. In equation (1), they use the kernel size in the temporal dimension to perform graph convolution. Therefore, the sampling area on the vertex is , where is the kernel size in temporal dimension, which is set to 9 in .
3. Posture Motion-Based Spatiotemporal Fusion Graph Convolution
3.1. Posture Motion Representation
The posture motion represents the motion information of the corresponding joint in a series of consecutive frames, for example, the joint of the given frame , i.e., and the joint of frame , i.e., , which posture motion is represented as . is the posture motion representation of the joint of frame u.
3.2. Local Posture Motion-Based Temporal Attention Module
A novel local posture motion-based temporal attention module (LMP-TAM) is proposed for suppressing a large amount of disturbance information in the temporal dimension. As shown in Figure 1, the posture motion feature map of each vertex in the spatiotemporal graph is calculated as follows:where is the input feature map at time . is the convolution weight matrix of . extracted the posture motion representation from the input feature map. is the posture motion feature map, where the channel of the feature map is half of the input channel.
Human motion is body movements, which involve part or all of the limbs. The attention map of skeleton sequence in the spatiotemporal graph is represented by the attention of local limbs in the temporal dimension. The importance of local limb in temporal dimension is determined by motion information in the local perception domain . , where denotes left hand, denotes right hand, denotes left leg, denotes right leg, and denotes other limb parts. , where refers to the number of limbs being denoted, and has been set 5 in this work. The temporal attention of local posture motion-based is calculated as follows:where refers to the importance of each frame of the spatiotemporal graph with a time length of . is the motion feature map of joints, and is the local limb mask set D. is the attention based on local limbs in the temporal dimension. is the multiplication of the corresponding elements of the matrices. is the sigmoid activation function. The final output is as follows:
The input feature map is multiplied by the attention feature map in a residual manner to calculate adaptive feature enhancement, and refers to the addition of corresponding matrix elements.
3.3. Local Posture Motion-Based Channel Attention Module
The local posture motion-based channel attention module (LPM-CAM) has been proposed to improve the ability to learn the strong discrimination representation between different postures. As shown in Figure 2, the input includes the posture motion feature map extracted based on the local posture motion-based temporal attention module and generated temporal attention, which multiplied of each other to obtain the spatial-temporal graph after attention. The temporal sequence action segment with rich action semantic information is paid more attention. The channel attention coefficient is calculated as follows:
The motion feature map after attention is denoted as , which is decomposed into several local limbs in the local perception field to represent local posture movement. The action sequence with important semantic on the spatial-temporal graph has been screened out by the temporal attention module, and the channel attention selects the strong discriminative representations between different posture movements for action recognition. is marked as ReLu nonlinear activation function. Concat refers to concatenate the local limb feature map:
The input feature map is multiplied by the channel attention feature map in way of residual connection to achieve adaptive feature enhancement.
3.4. Posture Motion-Based Spatiotemporal Fusion
In order to fuse skeleton joints information and motion features to achieve an end-to-end learning manner, the posture motion-based spatiotemporal fusion module (PM-STF) is proposed to fuse spatial and temporal features and enhance the discriminative feature adaptively. The output of temporal convolution module at the vertex of frame u is
This is different from formula (1), and the input is a posture motion feature map extracted from the spatiotemporal graph and adopts a residual connection to enhance the motion feature. is the posture motion feature of the neighborhood vertex . is the weighting function. refers to the mapping label of the subset of the neighborhood vertex , which is divided into three subsets , , and based on the spatial distance partition strategy.
To implement the PM-STF, equation (8) is transformed intowhere is posture motion feature map and is a convolution weight matrix, increasing the channel of the same posture motion feature map as input channel. is a 1 × 1 convolution weight vector. is a spatial attention map which is used to distinguish the importance of vertices. refers to the multiplication of the corresponding elements of matrices. is adjacency-like matrix, and
3.5. Implementation of PM-STFGCN
The implementation of our module is combined with the model of optimizing the spatial graph convolution only, such as ST-GCN and 2s-AGCN. Taking ST-GCN as an example, shown in Figure 3, the implementation of our module PMSTF-GCN is added between S-GCN and T-GCN. Each layer of PMSTF-GCN contains LPM-TAM, LPM-CAM, and PM-STF. S-GCN and T-GCN are named as the spatial graph convolution layer and temporal graph convolution layer of the original model. GAP is a global average pooling layer, and FCN is marked as a fully connected network layer. Finally, a spatiotemporal fusion graph convolution block is constructed. The overall architecture of the network consists of several STFGCN blocks. The batch normalization layer is added to the skeleton data input to normalize the input data. Finally, the global average pooling layer is implemented to pool the feature graphs to the same size, and the followed layer is a SoftMax classifier to obtain the prediction.
3.6. Implementation in Human-Robot Interaction
The presented action recognition schemes were applied on a real system, which consists of a Pepper robot and an external Kinect v2 depth camera. The implementation in human-robot interaction was performed as follows (Algorithm 1).
4.1.1. NTU-RGB + D
NTU-RGB + D  is the largest and most widely used multimodality dataset for skeleton-based action recognition. Each action segment was performed by 40 volunteers aged 10 to 35 and captured by three camera sensors at the same height but from different horizontal angles: −45°, 0°, and 45°. The human skeleton has 25 joint points, and the number of skeletons in each video is no more than 2. It contains 60 action classes and 56880 video clips. There are two kinds of training benchmarks, and  recommends as follows: cross-subject (CS) and cross-view (CV). In cross-subject (CS) benchmarks, the training dataset contains 40320 action samples, and the testing dataset contains 16560 action samples. In cross-view (CV) benchmarks, the training dataset contains 37920 action samples taken by camera sensors 2 and 3, and the testing dataset contains 18960 action samples taken by camera sensors 1. In the following experiments, we test the top-1 accuracy on two benchmark datasets.
Kinetics-Skeleton  is a large dataset for skeleton-based action recognition. Kinetic contains 300000 action video clips, and a total of 400 classes  used the publicly available OpenPose toolbox  to estimate the pose of 18 joints in each fragment frame. There are a total of 300 frames for each action video frame. According to the average joint confidence, two people are selected as multiperson clips in each frame. The training dataset contains 240000 video clips, and the testing dataset contains 20000 video clips. We make use of the training dataset and then perform experiments to verify the accuracy of top-1 and top-5 on the testing dataset.
4.2. Ablation Study
The effectiveness of the proposed method has been verified over two large skeleton datasets in Kinetics-Skeleton and NTU-RGB + D. The local posture motion-based attention module (LPM-TAM), local posture motion-based channel module (LPM-TAM), and posture motion-based spatiotemporal fusion (PM-STF) module are represented by PM-STFGCN. Two sets of comparisons are made between ST-GCN  and ST-GCN + PM-STFGCN, and between 2s-AGCN  and 2s-AGCN + PM-STFGCN. The results show that the performance has been improved over the original models and verified the effectiveness of LPM-TAM, LPM-CAM, and PM-STF.
As shown in Tables 1 and 2 for ST-GCN  and ST-GCN + PM-STFGCN, PM-STFGCN improves the top-1 accuracy of the CS and CV benchmarks by 4.2% and 1.6%, respectively, and the accuracy of the top-1 and top-5 of the Kinetics-Skeleton dataset by 2.5% and 1.9%, respectively. For 2s-AGCN  and 2s-AGCN + PM-STFGCN, PM-STFGCN improves the top-1 accuracy of CS and CV benchmarks by 3.3% and 1.3%, respectively, and the accuracy of the top-1 and top-5 of the Kinetics-Skeleton dataset has been improved by 1.4% and 2.0%, respectively. 2s-AGCN + PM-STFG performed best on NTU-RGB + D and Kinetics-Skeleton datasets.
4.2.1. Attention Module
Experiments were also performed to verify the local posture motion-based temporal attention module (LPM-TAM) and channel attention module (LPM-CAM). In the ST-GCN  and 2s-AGCN  networks, only the LPM-TAM or LPM-CAM is added to the convolutional layer of the spatial-temporal graph. The results are shown in Tables 1 and 2. Compared with ST-GCN, the LPM-TAM module improves the top-1 accuracy of the CS and CV benchmarks by 2.6% and 0.6%, respectively, and the accuracy of top-1 and top-5 of Kinetics-Skeleton by 1.0% and 0.9%, respectively. The LPM-CAM module improved the top-1 accuracy of the CS and CV benchmarks by 3.0% and 0.7%, respectively, and the accuracy of top-1 and top-5 of Kinetics-Skeleton by 1.4% and 1.1%, respectively. Compared with 2s-AGCN, the LPM-TAM module improves the top-1 accuracy of the CS and CV benchmarks by 1.7% and 0.5%, respectively, and the accuracy of the top-1 and top-5 of Kinetics-Skeleton by 0.4% and 0.9, respectively. The LPM-CAM module improves the top-1 accuracy of the CS and CV benchmarks by 2.1% and 0.7%, respectively, and the accuracy of the top-1 and top-5 of Kinetics-Skeleton by 0.5% and 1.1%, respectively. The temporal and channel attention module improved the recognition performance than the original model which verifies the effectiveness of the feasibility of the attention modules.
4.2.2. Spatiotemporal Fusion Module
Experiments were also carried out on the posture motion-based spatiotemporal fusion (PM-STF) module. In the ST-GCN  and 2s-AGCN  networks, only the PM-STF is added to the convolutional layer of the spatial-temporal graph. The results are shown in Tables 1 and 2. Compared with ST-GCN, the PM-STF module improved the top-1 accuracy of CS and CV benchmarks by 3.2% and 0.9%, respectively, and the accuracy of top-1 and top-5 of Kinetics-Skeleton by 1.6% and 1.2%, respectively. Compared with 2s-AGCN, the PM-STF module improves the top-1 accuracy of the CS and CV benchmarks by 2.5% and 0.7%, respectively, and the accuracy of the top-1 and top-5 of the dataset Kinetics-Skeleton by 0.8% and 1.3%, respectively. Compared with the original model, the spatiotemporal fusion module has a greater contribution to the improvement of the recognition performance which verifies the effectiveness and necessity of spatiotemporal fusion.
4.3. Comparison with State-of-the-Art Schemes
The proposed method is compared with some of the state-of-the-art schemes, and the results are shown in Tables 3 and 4. Among them, 2s-AGCN + PM-STFGCN achieved very good performance on CS and CV. On the Kinetics-Skeleton dataset, the accuracy of top-1 and top-5 of 2s-AGCN + PM-STFGCN also showed decent performance.
4.4. Human-Robot Interaction Demonstration
To further evaluate the robustness of the proposed action recognition schemes to distinguish similar action classes, action recognition is applied to a real system that consists of a Pepper robot and an external Kinect v2. As shown in Table 5, there is a correspondence between action semantics and interactive action.
The designed correspondence between action semantic and interactive activities ranges from partial and limb movements of hands to whole-body movements with more complexity. For example, waving the hand, touching the ear, holding the head with hands, and applauding are all hand movements. Among them, the hand movements of the first three movements are related to the head with high similarity. Also, squatting, sitting down, and jumping involve movements of the whole body with high similarity.
As shown in Figure 4, the measurement results are obtained in this work after conducting 50 experiments. Among them, each similar action has a high recognition accuracy which means our method can effectively distinguish each different action. Each action sequence can be seen as a combination of many steps. For example, waving the hand can be divided into two steps: first, raise your right hand to above the head; second, swing the hand around the head. Similarly, a video can be decomposed into multiple frames of images.
4.4.1. Strong Discrimination Analysis
As shown in Figures 5–8, there are examples of human-robot interaction with similar actions. A period of time action sequence has been calculated, and the classification result with the highest probability is selected as the recognized result. Skeleton sequence with low motion information can be filtered out well by LPM-TAM, which helps to identify the process from raising hands to the head and swinging, and more purposefully recognize interactive actions. The main action of touching the ear is the process of raising the hand to the ear. Compared with waving the hand, the main difference is the movement of the hand swinging near the head. The characteristics of strong discrimination have been paid more attention by LPM-CAM to constraint similar movement processes, such as the process of raising the hand which serves as the basis for action recognition. The action of holding the head with both hands is similar to touching the ears. However, the main difference is that holding the head with both hands is the movement of the left and right hands, while touching the ears is the movement of the limbs with one hand. The main difference between similar movements in the local limb area can be captured by LPM-CAM effectively that enables the proposed method to extract stronger and discrimination representation. The human-robot interaction experiments verified that similar action did not affect the recognition result at all and has a strong discrimination of similar actions.
4.4.2. Comparison with Speech Interaction
In this work, the two indicators of accuracy and real-time performance are compared with speech interaction. The accurate times of these interaction methods were recorded 50 times to verify the reliability of the action interaction. Figure 9 shows the confusion matrix of the Pepper robot speech interaction recognition. In the testing phase, it only needs to speak out the corresponding action, such as wave the hand or touch the ear. The recognition result is regarded as “jumping” if the result of speech interaction has not been recognized within the specified test time. The recognition result of speech interaction is easily affected by external noise and distance which cause recognition errors, or no recognition results. From the experimental results, the average recognition rate of action interaction and speech interaction is 95.7% and 94.8%, respectively. Compared with speech interaction, our scheme has highly competitive which verifies the reliability of the action interaction in the recognition effect.
4.4.3. Comparison of Response Time
As shown in Figure 10, comparison with the response time of speech and action interaction shows the average time of the 10 test results. Due to the different durations of each action, using the same time segment as inputs will cause fluctuations of response time. We try to do a few more experiments to eliminate the differences among the action response time. The results show that the response time of action interaction is shorter than speech interaction because of the robustness to external environment noise. The average response time of speech and action interaction is 2.05 s and 1.86 s, respectively. Compared with speech interaction, the proposed scheme reduced the responding time by 0.19 s in real-time. The main reason is that video frames within a certain time range are used for recognition and shorter processing time for the action recognition network.
In conclusion, through the experimental comparison of two human-robot interaction ways, the action recognition has its advantages: it is not affected by environmental noise or spatial distance; it provides better real-time response during the interaction.
Previous works in the literature mostly make use of modeling of motion information and skeleton joint information independently, which cannot fully express the relationship between them. The posture motion-based spatiotemporal fusion graph convolution network (PM-STFGCN) is presented to fuse temporal and spatial features and enhance the posture motion features adaptively with high discrimination. A novel local posture motion-based temporal attention (LPM-TAM) module is introduced to suppress the disturbance information with low motion in the temporal domain efficiently and fully learn the representation of the posture motion. The local posture motion-based channel attention module (LPM-CAM) is proposed for the purpose of learning strong discrimination representation between different motion postures which improved the ability to discriminate action classes, and the posture motion-based spatiotemporal fusion module (PM-STF) is adopted to fuse the motion feature and skeleton representation effectively. Extensive experiments were performed on two large skeleton datasets, and the constructed scheme shows substantial improvement over some other methods. The proposed action recognition interaction system has a competitive performance in accuracy and response time compared with speech interaction.
The data used to support the findings of this study are included with the supplementary information files.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This material is based upon work funded by the State Key Laboratory for Manufacturing Systems Engineering, Xi’an Jiaotong University Foundation of China, under grant no. sklms2019011, “13th Five-Year Plan” Talent Training Project of Higher Education in Zhejiang Province under grant no. jg20190487, and Research Project of Educational Science Planning in Zhejiang Province under grant no. 2020SCG090.
K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” 2014.View at: Google Scholar
T. Bagautdinov, A. Alahi, F. Fleuret, P. Fua, and S. Savarese, “Social scene understanding: end-to-end multi-person action localization and collective activity recognition,” in Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, July 2017.View at: Publisher Site | Google Scholar