Abstract

In skeleton-based human action recognition methods, human behaviours can be analysed through temporal and spatial changes in the human skeleton. Skeletons are not limited by clothing changes, lighting conditions, or complex backgrounds. This recognition method is robust and has aroused great interest; however, many existing studies used deep-layer networks with large numbers of required parameters to improve the model performance and thus lost the advantage of less computation of skeleton data. It is difficult to deploy previously established models to real-life applications based on low-cost embedded devices. To obtain a model with fewer parameters and a higher accuracy, this study designed a lightweight frame-level joints adaptive graph convolutional network (FLAGCN) model to solve skeleton-based action recognition tasks. Compared with the classical 2s-AGCN model, the new model obtained a higher precision with 1/8 of the parameters and 1/9 of the floating-point operations (FLOPs). Our proposed network characterises three main improvements. First, a previous feature-fusion method replaces the multistream network and reduces the number of required parameters. Second, at the spatial level, two kinds of graph convolution methods capture different aspects of human action information. A frame-level graph convolution constructs a human topological structure for each data frame, whereas an adjacency graph convolution captures the characteristics of the adjacent joints. Third, the model proposed in this study hierarchically extracts different levels of action sequence features, making the model clear and easy to understand; further, it reduces the depth of the model and the number of parameters. A large number of experiments on the NTU RGB + D 60 and 120 data sets show that this method has the advantages of few required parameters, low computational costs, and fast speeds. It also has a simple structure and training process that make it easy to deploy in real-time recognition systems based on low-cost embedded devices.

1. Introduction

Human action recognition can be used in various scenes, such as video retrievals and human-computer interactions [1], so it has been widely discussed in the literature. However, the diversity and complexity of human behaviours have introduced great challenges to the task of human action recognition. Biological research has shown that even in the absence of appearance information, it is possible to distinguish among action categories by analysing joint movements [2]. Skeleton data comprise the three-dimensional position data of several key joints in the human body, characterising rich depth information [3]. These data are not affected by the clothes worn by subjects, the lighting conditions, or environmental noise. Additionally, these data have strong robustness and can express advanced human movement characteristics. With the advent of 3D cameras, such as Kinect cameras [4], skeleton data have become easy to obtain, and action recognition studies based on skeleton data have attracted more attention and made great progress [5], becoming an important branch of human action recognition research.

In the initial stage, due to the limitations of data sets, skeleton-based human action recognition researchers have mainly used manual feature-extraction and machine-learning methods. Since Shahroudy et al. [6, 7] established the NTU RGB + D data set, a large-scale data set for 3D human activity analyses, deep learning has been widely used in skeleton-based human action recognition studies. The existing research has been divided mainly into two directions: models based on convolutional neural networks(CNNs) [812] and models based on recurrent neural networks (RNNs) [1318]. CNN based-methods regard the X, Y, and Z coordinates of joints as image channels, whereas the frame number and joint number of each action sequence are regarded as the length and width of the corresponding image, respectively. RNN-based methods consider the time series characteristics of human behaviour and use RNNs to model these behaviours over time.

However, skeletons are non-Euclidean structural data in which joints are disordered. Different joints have different neighbouring nodes connecting the human skeleton. If joints are input into a convolutional network sequentially, the information obtained for the joints near any given node may not be adjacent in a real human skeleton. Therefore, it is difficult to extract local and global joint features using traditional convolution methods. Recurrent neural networks carry out only temporal modelling and cannot fully express the spatial information of joints in skeletons. Graph convolutional neural network (GCN) is a new type of convolutional neural network. Yan et al. first applied a graph convolution method in a skeleton-based human action recognition study [19] and proposed a spatiotemporal graph convolutional network (ST-GCN). The ST-GCN model constructs the spatial structure of the human skeleton according to the adjacency between two joints in the human body, significantly improving the recognition performance of the model and reflecting the applicability and superiority of the GCN in this task. Graph convolution has gradually become a mainstream research method for skeleton recognition, and researchers have carried out specific research based on the idea of graph convolution [2032]. Combining the graph convolution with excellent network structures, such as attention networks [33, 34] or residual networks [35, 36], can further improve the human skeleton recognition accuracy.

The current mainstream researches based on ST-GCN improve the recognition accuracy of skeleton recognition task by multistream input [37], adding optimization module, improving loss function [38], improving convolution kernel [24, 39], and increasing attention [34]. These methods make the network deeper and the structure of each layer more complex; they often introduce many parameters and extremely difficult training processes and frequently require many computing resources and long training times. Additionally, these methods not only place high demands on the computing performance of the utilised equipment but also take a long time to predict action sequences in practical applications. Therefore, these models are difficult to be applied to real-time recognition applications based on low-cost embedded devices.

To solve the problems described above, this study proposed a lightweight hierarchical model called a frame-level joints adaptive graph convolutional network (FLAGCN). The hierarchical model consists of four parts: the data-processing level, point level, spatial level, and temporal level. There are six core layers, namely, the coordinate embedding layer, three frame-level joints adaptive graph convolutional (FLAGC) layers, and two CNN layers. The FLAGCN not only ensures a high recognition accuracy but also greatly reduces the required parameters and computational complexity of the model, thus reducing the training time and prediction time of the model and providing a solution for building a real-time recognition system. The main contributions of this study are as follows.

Three mainstream features (bones and the relative positions and motions of joints) are acquired and fused early in the modelling process, replacing the traditional multistream network. The model inputs can obtain useful discriminant information and reduce the required training parameters and computational costs. In addition, feature generation is integrated into the model, thus avoiding the extra previous feature generation operation.

The proposed model uses a three-layer frame-level joints adaptive graph convolution method to capture human motion information from two aspects: a frame-level graph convolution and an adjacent graph convolution. The frame-level graph convolution method adaptively constructs different graphs for each data frame of each action sequence and captures the spatial characteristics of each frame. The adjacency graph convolution method uses a predefined adjacency matrix to capture the relationships between adjacent joints and fully utilises the prior information characterising the human skeleton. The combination of these two graph convolution methods improves the ability of the proposed model to extract spatial features.

In this study, the features of skeleton sequences are extracted hierarchically. In contrast from the spatiotemporal graph convolution layer, spatial and temporal features are extracted at each layer. In the model proposed in this study, the three-dimensional coordinate features of joints are mainly extracted at the point level, whereas the spatial features of all joints in each frame are extracted at the spatial level and the temporal features of the whole sequence are extracted at the temporal level. Therefore, the model is simple, clear, and easy to understand. The ablation experiment confirms that the layered feature-extraction process utilised in this model can effectively improve the recognition accuracy of skeletons with a small number of required parameters.

2.1. Skeleton-Based Action Recognition

In traditional methods, machine learning is used to solve human action recognition tasks based on human skeletons. For example, Vemullapally et al. [40] used a combination of dynamic time warping, the Fourier time pyramid, and a linear support vector machine (SVM) to classify skeletons. Zanfir et al. [41] expressed each action by its associated joint velocity and acceleration in the key frame and classified the actions using an improved k-nearest neighbour (KNN) classifier integrated with global time information. Continuously progressing deep learning methods have shown excellent data-processing abilities and allowed breakthrough progress to be made in the computer vision and natural language processing fields. With the emergence of large data sets [6, 7], deep learning has also been used for human action recognition based on skeletons. For example, Li et al. [8] proposed a hierarchical co-occurrence feature-learning framework based on the global aggregation capability of CNNs. They learned the point-level features of each joint independently and fused the motion features with a double-flow frame. Nie et al. [9] proposed two descriptors and input them into a CNN network. Pan et al. [17] constructed a dual-stream, long short-term memory (LSTM) network to extract multilevel attitude and trajectory features. Zheng et al. [18] introduced a recurrent relational network and designed an organic framework to simultaneously simulate the spatial allocation and temporal dynamics of joints. These works have achieved improved performances compared with previously utilised methods.

2.2. Graph Convolutional Network

Graph convolutional networks, which have arisen as a new network form in recent years, show advantages in unstructured data processing and are widely used in traffic flow predictions, network node classifications, and molecular activity predictions in biochemistry [42]. Inspired by these advantages, Yan et al. [19] proposed a spatiotemporal graph convolution method in which every joint in the human skeleton corresponds to every node in a skeleton graph, and the connections between joints are defined as the edges of the skeleton graph. There are two types of edges in action sequences. Spatial edges refer to natural joint connections; these edges are thus predefined by an adjacency matrix characterising human joints. Temporal edges refer to the virtual connections of the same joints between adjacent frames and are simulated by the selected temporal convolution method. Shi et al. [20] proposed a dual-stream, adaptive, graph convolutional network (2s-AGCN) that trains and updates the skeleton graph structure together with the convolutional parameters of the model. This data-driven method improves the flexibility of the resulting graph. At the same time, to utilise the second-order information (the lengths and directions of bones) of the skeleton data, the model adds bones as inputs in another stream. The lengths and directions of the bones are expressed as vectors pointing from the source joints to the target joints. This method compensates for the shortcomings of the ST-GCN predefined graph, such as its lack of flexibility and inclusion of only first-order information, and achieves a better recognition effect. In recent years, some researchers have devoted themselves to optimizing the structure of the skeleton graph to improve the utilised networks based on graph convolution methods, whereas others have combined additional theories with graph convolution. For example, the dynamic framework proposed by Ye et al. [27] takes advantage of both GCNs and CNNs. The shift GCN designed by Cheng et al. [28] is composed of a spatial-shift graph convolution method and a temporal-shift graph convolution method, and its computation costs are greatly lowered. Si et al. [36] proposed the attention-enhanced graph convolutional LSTM (AGC-LSTM) network, representing the first attempt to combine graph convolution with LSTM for the task of human action recognition. Zhao et al. [43] combined graph convolution with LSTM and further extended the network to a probability model following a Bayesian framework. Peng et al. [31] constructed a graph convolutional network using a neural architecture search.

Some of the methods mentioned above require large numbers of parameters and deep networks or a lot of calculations. If the model is applied to low-cost embedded devices with limited memory or computing power, it is difficult to ensure good real-time recognition performance. The frame-level, adaptive, graph convolutional model proposed in this study combines the advantages of frame-adaptive graphs and adjacency matrices to extract spatial features and uses a simple network and lightweight model to realize the high-precision recognition of human actions based on skeletons. The model can be adapted to such embedded devices with small cost.

3. Methodologies

In this section, the proposed model is introduced in three parts. The first part describes the feature fusion at the point level. The second part introduces the details of the frame-level, adaptive, graph convolutional layer used in the spatial layer, focusing on two graph convolutional mechanisms. In the third part, we analyse the proposed hierarchical feature-extraction model and introduce the data-processing level and temporal level.

3.1. Point Level: Early Feature Fusion

Although neural networks can autonomously learn data features, many studies have indicated that early feature processing can improve the performances of models, so it is necessary to select distinctive features [4448]. For example, inspired by the Lie group-based skeleton descriptor [44], Jiang et al. [16] proposed a spatiotemporal skeleton transformation descriptor (ST-STD) to define the relative transformations of skeleton gestures, including rotation and translation during skeleton movement. Ahad et al. [45] used the linear joint position feature (LJPF) and angular joint position feature (AJPF) obtained based on the three-dimensional linear joint positions and angles between skeleton segments as distinctive features. Nie et al. [9] proposed two new viewpoint-invariant motion features: the Euler angle of joints (JEAs) and the Euclidean distance matrix between joints (JEDM). Li et al. [23] chose a total of six data modalities (joints, bones, their motions, and their relative positions) and independently fed these modalities into the network with a six-stream input.

Later, bones, and the relative position and motion information of joints became common features in skeleton-based action recognition because they are easy to obtain and have a strong discrimination ability [18, 21, 23, 26, 29, 30, 44]. Therefore, we first generate these three features at the data-processing level of the model.

The relative position of a joint is obtained by subtracting the coordinates of the joint centre from the coordinates of any other joint. This value can be calculated using equation (1), where a is an arbitrary joint and c is a central joint. Because the distances and angles between the skeleton and observation points are uncertain, the relative positions of joints can be used to reduce the influence of position changes among people and observation points. If the centre of the frame was subtracted from each joint, the motion information of the central joint would be lost; considering this, we determine the middle of the spine in the first frame of a given action sequence as the central joint.

Bones refer to the edge vectors formed by the natural connections within the human body. In our model, 25 bone vectors defined in 2s-AGCN [20] are used. Each bone is calculated by the vector difference between the two joints constituting the bone, as shown in equation (2), where t is the target joint node and s is the source joint node.

The motion information of joints is obtained by calculating the coordinate differences between the adjacent frames representing the same joints as shown in equation (3), where t2 represents the frame following t1. The empty frame at the end is filled with the value of 0, causing less computation and a simpler operation than the interpolation frame-alignment method. Because the time interval between two adjacent frames is fixed, the motion information can indicate not only the change in joint position but also the speed of the joint motion.

The two-stream or multistream networks used in some researches [8, 14, 18, 20, 21, 23, 30] have achieved good performances, but they have also increased the required numbers of model parameters. Therefore, this article embeds bones and the relative positions and motions of joints into a high-dimensional space at the point level and then fuses these three features without multiplying the parameters. This data fusion can be expressed as follows:

In equation (4), the bones and relative positions and motion information of joints are described as in equations (1), (2), and (3) and embed (·) represents the embedding operation, which is composed of convolution operations with two convolution kernels with sizes of , similar to a dense layer; the operation realized by each layer is shown in the following:where is the output of the upper layer, is the output of the current layer, is the weight, is the bias, and Max is the rectified linear unit (ReLU) activation function.

The relative positions of joints represent first-order information, whereas the bones and joint motions represent second-order information. The early feature-fusion method described above combines the advantages of these three features at different levels to obtain distinctive features and avoid the use of multistream networks. The data-processing and point level details of our hierarchical model are shown in Figure 1. The original skeleton is directly input into the model, and three features are generated at the data-processing level and input to the point-level feature-extraction layer. After being embedded separately, these features are added to the model. Additional data-processing layer details are provided in-depth in Section 3.3.

3.2. Spatial Level: Frame-Level Graph Convolution and Adjacent Graph Convolution Methods

In traditional skeleton-based human action recognition methods, the skeleton is treated as structured data similar to an image, and the spatial relationships between joints are ignored. The ST-GCN introduced a graph convolutional neural network and defined a spatiotemporal skeleton sequence composed of nodes and edges, where nodes refer to the joints in the skeleton and edges are divided into two categories. In the same frame, the connecting relationships between human joints are considered as the first edge type, representing spatial information, and these connections are represented by an adjacency matrix. For the same joints, the connections between adjacent frames are considered to be the second edge type and are used to extract temporal information. The ST-GCN uses an adjacency matrix to perform graph convolution and extract spatial information. The graph convolution is realized using the following equation.where is the input of a given spatiotemporal graph convolutional layer, is the output of the corresponding layer, stands for the weight, is the adjacency matrix, is the attention mask, and is the subset category number. In the ST-GCN, three connection mode subsets are identified: self-connection, centripetal connection, and centrifugal connection. The adjacency matrix is determined using the connections between joints. The corresponding positions of connecting joints in the skeleton are defined as 1, and joints without connections are defined as 0. The spatial connections between joints are determined through multiplication operations within the adjacency matrix. The temporal connections are realized by a convolution operation in the time dimension.

ST-GCN directly multiplies and by their corresponding elements. If some elements in have values of zero, the final multiplication result is zero regardless of the remaining values. This means that if a connection between two joints does not exist in the original skeleton, the network ultimately does not produce this connection. However, in some behavioural actions, two unconnected joint nodes have notable relationships. For example, during actions such as “drinking water” and “eating,” great correlations exist between the hands and head. However, the hands and head are not directly connected, and it is thus difficult for the network to capture this correlation. In view of this joint edge limitation in the ST-GCN, the 2s-AGCN adds an unconstrained, parameterized adjacency matrix () and an independently graph-calculated matrix () for each sample, enhancing the flexibility of the model. Their method is shown in the following equation:

Compared with equation (6), equation (7) adds the sample-level adaptive parameters and ; the other parameters are the same as those in equation (6). However, the addition of the adjacency matrix , parameterized matrix , and sample-level matrix leads to the loss of spatial information. The 2s-AGCN does not consider the variations in graph variation among the different frames in each sample. In fact, during the process of each action, different frames show different graph characteristics. Therefore, we use the frame-level, adaptive, graph convolutional layer on the spatial layer of the model to capture the spatial features. Each frame-level, adaptive, graph convolutional layer includes two branches: the frame-level adaptive graph convolution branch and the adjacent graph convolution branch, in which predefined graphs are used. The whole calculation mechanism of the FLAGC layer is as follows:

The first half of equation (8) is part of the frame-level graph convolution, is the weight of the graph convolution, and is a frame-level graph of the action sequence. is similar to the term in the 2s-AGCN and uses a classical Gaussian embedding function to capture the similarity between joints. In contrast from the 2s-AGCN, we preserve the graph information of each frame and call this information the frame-level graph matrix. The calculation method is shown in the following equation:where is the input matrix with the shape of and andare the weights of two embedding layers. The embedding operation used here is the same as that used at the point level and consists of two convolutional layers with a convolution kernel size of . and represent two different transposes. The obtained term is a frame-level similar graph matrix with a scale of . The frame-level graph does not use prior information but adaptively trains the corresponding graph structure at each frame of each sample and extracts the spatial features of each frame in the skeleton.

The second half of equation (8) is the adjacency graph convolution module. The adjacency matrix, , in the module consists of three matrices ( = 3). The first matrix represents the relationships between joints with distances of zero, i.e., the autocorrelation of joints. The second matrix represents the correlations between joints with distances of one. The third matrix represents the correlations between joints with distance of twos. Thus, adjacent features with different distances are extracted from these three matrices. The values of connected positions within the matrix are 1, and the values of the nonconnected positions are 0. The regularization process is shown in the following equation:where is the adjacency matrix, which is defined according to the bone analysis, and is used to normalize ; the adjacency graph matrix adds the known skeleton information to the spatial layer, making full use of prior information to further help the spatial layer extract more spatial features.

Figure 2 shows the overall architecture of the FLAGC layer, and the calculation details are shown in Figure 3. The upper branch of Figure 2, corresponding to the left half of Figure 3, shows the frame-level graph convolution module. The module calculates the corresponding frame-level graph matrix from the input samples and then performs a graph-convolution operation. The lower branch of Figure 2, corresponding to the right half of Figure 3, shows the adjacency graph convolution module, which performs the corresponding graph-convolution operation with the input information using the predefined adjacency matrix . The FLAGC layer performs these two kinds of graph convolutions in parallel mode, making full use of the information contained in the samples and of the prior information.

3.3. Hierarchical Model: A Simple and Accessible Lightweight Model

Our proposed hierarchical model consists of four main parts, the data-processing level, point level, spatial level, and temporal level, as shown by the dotted box in Figure 4. The six core layers described above are marked with the Roman numerals I-VI, the coordinate embedding layer exists at the point level, the three-layer FLAGC exists at the spatial level, and the two-layer CNN exists at the temporal level.

In the data-processing layer, the skeleton rotates in the vertical direction. In practice, observed actions may not be collected with the subject completely facing the camera, and randomly rotating the skeleton is equivalent to increasing the amount of data sourced from different perspectives, playing a role in enhancing the data [11, 26, 41, 43].This random rotation operation is performed according to the following equation:where stands for the random rotation angle, stands for the original coordinates, and stands for the rotation coordinates. Subsequently, three features are calculated and generated by equations (1), (2), and (3). The feature-generation method integrated into the model reduces the workload of early data preparation and turns the model into an end-to-end recognition system.

In the spatial layer, the three-feature-fused data are taken as the inputs, the spatial information of the skeleton is extracted using three FLAGC layers, and the spatial dimensions of output data are converted into one dimension by a pooling layer to complete the spatial feature extraction.

In the temporal layer, the temporal features in the skeleton sequence are extracted continuously by two convolutional layers, and the kernel size of the convolutional layer is . Convolution with a kernel size of is similar to one-dimensional temporal convolution, in which the information is convoluted only in the temporal dimension; this convolution method can be expressed by the following equation:where is the input of the upper layer, is the convolution kernel, and is the bias. Each CNN is followed by BN and ReLU layers. After the two-layer convolution, AdaptiveMaxPool2d pools the temporal dimension of the output data into one dimension to complete the spatial and temporal information extraction, and the dense dual-layer completes the final action classification. The model hierarchically extracts action sequence features with different dimensions in a process that is clear and easy to understand. Compared with the ST-GCN [19], which adopts spatiotemporal layers to simultaneously extract spatial and temporal features, the method proposed herein simplifies the model structure, number of layers, and computational costs.

4. Experiment

4.1. Data Sets

NTU RGB + D 60 data set [5] is one of the earliest available, large-scale, multimodal, human action recognition data sets and was created in 2016. It contains RGB videos, depth information, skeleton information (the three-dimensional positions of 25 main joints), and infrared data. The 60 actions contained within the data set were collected from 40 subjects, and three Kinect v2 cameras were placed in 17 different shooting positions. This data set solves the problems associated with the use of a single visual angle, limited action categories, and changeless backgrounds that often arise in human action recognition studies based on deep learning. This data set provides two evaluation criteria: the cross-subject (CS) and cross-view (CV) criteria. The CS tasks include 40,320 training samples and 16,560 test samples; these samples are divided by categorizing the 40 subjects into two groups. The CV training video samples are collected by cameras 2 and 3 (37,920 samples), and the videos collected by camera 1 (18,960 samples) are regarded as the test set.

NTU RGB + D 120 data set [6] is an extended version of NTU RGB + D 60 and was created in 2019; an additional 60 actions and 57,600 samples are included in this extended data set. The resulting data set contains videos from 155 different camera perspectives. A total of 106 subjects of different ages (10 to 57 years old) and different cultural backgrounds (15 countries) recorded in 96 different environments (different backgrounds or light conditions) are comprised in the data set. A total of 114,480 samples are included; in these samples, the actions are mainly divided into three aspects, daily, medical, and interactive human behaviours, covering the most common behaviours in human life. In the cross-subject evaluation, the 106 subjects are randomly divided into two groups. The fifty-three people in each group are used for either training (63026 samples) or testing (50919 samples). Among the 32 different Kinect setting methods, odd-numbered methods are used for training (54,468 samples), and the rest are used for testing (59,477 samples) in the cross-view evaluation. Notably, the 535 missing samples should be ignored.

4.2. Experimental Details

To align the frame numbers of the action samples, we counted the frame numbers of the NTU RGB + D 120 samples and found that most of the samples were contained within 100 frames except the “reading,” “writing,” “wearing a jacket,” and “taking off a jacket” samples. The proportion of samples within 100 frames was close to 70%. The distribution of sample frame numbers is shown in Figure 5. The horizontal axis represents the frame ranges, and the vertical axis represents the number of samples falling into the ranges. Understandably, many common actions in life are completed in 3 seconds. Therefore, our data-processing layer samples only nonzero frames and randomly and evenly selects 20 frames as training input, as in the study by Zhao et al. [43]; the sampling interval is determined by the following equation:where represents the original total number of frames and represents the standard number of frames to be aligned. After the sampling interval is obtained, we randomly select one frame in each interval. For example, if the total number of data frames is 100, one frame is extracted every 5 frames. Such a random uniform extraction can truly reflect the behaviours and actions contained within the samples and makes the training process easy and fast. In addition, this random extraction method allows each round of training samples to not be completely the same and reduces the overfitting of the training sets.

In addition, considering the diversity of the camera perspectives contained in the data sets and practical applications, skeletons are randomly rotate [−30°, 30°] to enhance the data and adapt to changes in perspective at the data-processing level.

We implement this model with PyTorch and train it on a Titan graphics processing unit (GPU). The Adam optimizer is used for the optimization, and the weight decay is set to 0.0001. The initial learning rate is set at 0.001, and this value is decreased by a factor of 0.1 in the 60th, 80th, and 100th rounds. The maximum number of training periods is set to 100. The batch size of the two data sets is 64. The channel of each layer is shown in Figure 6. The embedded layer consists of two convolutional layers, and the size of the cube indicates the size of the data.

4.3. Ablation Study
4.3.1. Fusion Modes of Different Features

With regard to the early fusion method in which three features (bones and the relative positions and motions of joints) are fused, we test various combinations, such as individual features, feature connections, and feature additions. Finally, we confirm that the feature-fusion method proposed in Section 3.1 can be used to optimize the action recognition accuracy.

In Table 1, P stands for the relative position, B stands for the skeleton vector, M stands for the motion information, + stands for the addition operation, and “cat” stands for the point dimension splicing. We try to use “cat” to minimize the number of parameters and finally determine that the method in which three features are added after embedding results in the best accuracy performance. Table 1 shows that the use of multiple inputs can bring about increased accuracies by 1%–5%.

4.4. Effectiveness of the Frame-Level Graph Convolution

We make many attempts to explore the validity of two kinds of graph-convolution structures in frame-level adaptive graph convolutional layers. First, we address the frame-level graph convolution structure using two methods to calculate the global graph of each sample instead of the frame-level graph, representing are the maximum and average graph matrix values in the temporal dimension. The results are shown in Table 2.

From Table 2, we can see that the accuracy obtained using the mean value is slightly higher than that obtained using the maximum value. The frame-level graph convolution method achieves the best performance without requiring additional parameters or computational costs (floating-point operations (FLOPs)). The parameter and FLOP units are 106 and 109, respectively.

To confirm that the two utilised kinds of parallel graph convolution play a positive role in the overall modelling process, we put the two kinds of graph-convolution structures into the spatial layer separately and after superposition, as is conducted in the 2s-AGCN. The results are shown in Table 3, which indicates that the parallel structures of the two convolution operations induce an accuracy improvement of more than 1%.

4.5. Effectiveness of the Hierarchical Model Structure

To further explore the effectiveness of layered feature extraction in the studied model, similarly to the methods used in previous studies [19, 20], we add the temporal feature-extraction layer to each spatial-extraction layer to form a similar spatiotemporal graph scroll layer. The results are shown in Table 4. When more parameters are included, the spatiotemporal graph convolution method performs poorly in cross-object tasks, and its accuracy is 1.6% lower than that of the hierarchical model. To better extract temporal features, we also try to use the LSTM and gated recurrent unit (GRU) methods to construct time modules. Similarly, in cases considering more parameters, the accuracies of both methods are reduced by more than 2%. The third row in Table 4 shows the accuracy of the data-processing layer without random rotation and shows that because random rotation simulates the characteristics of visual angle changes, this situation performs poorly in the cross-visual angle-recognition task.

4.6. Comparison with the SOTA Method

Table 5 shows a performance comparison between our model and other excellent methods based on different networks with the NTU RGB + D 60 data set. The accuracies of our model are 89.4% on CS, which is superior to the accuracy of 0.4% obtained with a previously established method [26], and 94.8% on CV, which is superior to the accuracy of 0.3% obtained with the same previous method [26].

Because the NTU RGB + D 120 data set is new, some established methods have not been tested this data set, and many methods do not provide the number of parameters, calculation amount, and prediction speed. We have obtained some data marked with “” in Table 6 through our own test. Table 6 compares the accuracies, parameters, FLOPs, and prediction speeds of these models. The parameters, FLOPs, and prediction speeds are all based on the NTU RGB + D 60 data set. The prediction speeds listed in Table 6 represent the average time required for the trained model to predict a given action sequence. The result shows that FLAGCN has accuracies of 81.6% in the CS task and 82.9% in the CV; these accuracies are slightly higher than those of the 2s-AGCN. At the same time, the number of required parameters is reduced to less than 1/8, the calculation cost is reduced to less than 1/9, and the prediction speed is 7 times faster than those of the 2s-AGCN. In Table 6, SGN is shown to have obtained the smallest number of parameters and the fastest speed, but its accuracy is slightly lower than those of other models. The Sybio-GNN achieves the highest accuracy, but its required parameters are numerous. The number of required parameters in ResGCN-N51 is lower than that of our method and its accuracy is higher when applied to the NTU RGB + D 120 data set. But when the method is applied to the NTU RGB + D 60 data set, the accuracy is lower than our method. In contrast from our model, the ResGCN-N51 model uses a parallel extraction structure to obtain spatiotemporal features.

4.7. Visualization of the FLAGCN

To further confirm that the FLAGCN can model the spatial structures embodied in human actions and to explain the model by displaying the features extracted from each layer of the model, we made two visual displays: a presentation of the three-layer FLAGC and a frame-level graph matrix. To make the skeleton appear clearer, we display it in 2D rather than 3D, so there is some occlusion.

First, to confirm that spatial features can be gradually extracted from the three-layer FLAGC, we show the output of each layer at the spatial level. We obtain the shape data with a size of by averaging all dimensions except the joint dimensions. The data are normalized and enlarged 100 times as joint size. We choose two representative hand and foot movements as examples, as shown in Figure 7: panel (a) represents the “wipe face” movement, whereas panel (b) displays the “kicking something” movement. The three subgraphs of each action are the output of the first FLAGC layer, the second layer, and the third layer. The weights of all joints in the output of the first layer do not differ extensively, but the weights of the hands and feet increase in the outputs of the second and third layers.

In addition, we visualize the frame-level graph in our proposed frame-level graph convolution, indicating that the weight of each frame in a given action is different and that the FLAGC layer captures this difference. After the previous iteration, the frame-level graph matrix of the third FLAGC layer can most clearly support our viewpoint, so we choose it for the visualisation. In Figure 8, panel (a) displays the “headache” movement and panel (b) shows the “cross hands in front” movement. We choose the figures of the second, tenth, and nineteenth frames of these two action sequences to represent the early stage, middle stage, and late stage of each action. The graph parameters calculated by the network are normalized and expanded to 100 times the size of the corresponding joint. The weights of all joints do not differ extensively in the early stage, but in panel (a), the weights of the hands and head increase in the middle and late stages; panel (b) shows that the head and elbow have more weight in the early stage. The weight of the arms increases in the middle stage, that of the head decreases, and that of the middle part of the spine begins to enlarge due to the arms reaching the vicinity of the middle part of the spine in the later stage. This is consistent with our perspective; that is, the structure of the graph may constantly change during the described action. At the same time, the observed changes in key joints at different movement times also align with common sense. Because the visualisation results of time layer have no practical significance, they are not displayed.

5. Conclusion

This study proposed a lightweight hierarchical model with early feature fusion and frame-level adaptive graph convolution, which can be applied to resource-constrained embedded devices. Our model uses 6-layer network architecture instead of the traditional 9-layer network architecture. The proposed model reduces the number of required parameters and the computational costs of the model and provides a simple method for real-time human skeleton action recognition. In this model, the early feature-fusion process integrates the advantages of multiple-feature utilisation without a multistream network. The FLAGCN is divided into two branches to capture spatial information: the frame-level graph convolution branch calculates the graphic structure of each frame, whereas the adjacent graph convolution branch extracts the relationship between adjacent nodes using the adjacency characteristics of the corresponding joints, and the combination of these two graph convolution methods allows spatial information to be extracted in action sequences more comprehensively. The network is designed as a hierarchical, effective, end-to-end model. To test this model, many explorations are made. The final model is verified on the NTU RGB + D 60 and 120 data sets and uses only 1/8 of the parameters and 1/9 of the FLOPs required by the 2s-AGCN while achieving a higher recognition accuracy and faster prediction speed.

Next, we hope to deploy the model in our own embedded system and apply it in a variety of application scenarios, such as display, performance, game, and so on. We will focus on the applicability of the model in different hardware conditions with limited storage and computing performance.

Data Availability

The data sets used in this paper are public, free, and available at https://rose1.ntu.edu.sg/dataset/actionRecognition/.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by Funds for Key Laboratory of Ministry of Culture and Tourism (WLBSYS2005) and the Fundamental Research Funds for the Central Universities (CUC19ZD005).