Abstract

With the broadening of application scenarios for Internet of Things, intelligent behavior recognition task has attracted more and more attention. Since human behavior is nonrigid motion with strong spatiotemporal topological association, modeling it directly with traditional Euclidean space-based methods may destroy its underlying nonlinearity. Based on the advantages of Riemannian manifold in describing 3D motion, we propose an end-to-end 3D behavior manifold feature learning framework composed of deep heterogeneous networks. This heterogeneous architecture aims to leverage the graph construction to guide manifold backbone network to mine more discriminative nonlinear spatiotemporal features. Therefore, we first model the nonlinear spatiotemporal co-occurrence of 3D behavior in the high-dimensional Riemannian manifold space. Secondly, we implement a non-Euclidean heterogeneous architecture on the Riemannian manifold so that the backbone network can learn deep spatiotemporal features while preserving the manifold topology. Finally, an end-to-end deep graph similarity-guided learning optimization mechanism is introduced to enable the overall model to fully utilize the complex similarity relationship between manifold features. We have verified our 3D deep heterogeneous manifold network on popular skeleton behavior datasets and achieved competitive results.

1. Introduction

Behavior recognition tasks [13] receive much attention due to the vigorous development of artificial intelligence and the rise of computer vision. In smart security, human-computer interaction, and immersive games, behavior recognition is playing an increasingly important role. We can perform dangerous behavior warnings, provide more convenient behavior instructions for human-computer interaction, and make immersive games have a rich and exquisite game experience through behavior recognition. With the great improvement of computer and devices for capturing the movement of human skeleton, the acquisition of skeleton sequence data is more convenient, which promotes the development of skeleton-based behavior recognition [4, 5]. The skeleton-based behavior recognition method has the advantages of eliminating the influence of the background and the invariance of the perspective, which brings the ability to pay more attention to the behavior itself. For these reasons, more and more researchers are involved in skeleton-based action recognition research.

There are three main methods of existing behavior recognition: methods based on spatial features of skeleton coordinates, methods based on temporal information of skeleton sequence, and methods based on spatiotemporal features. In the method based on spatial features of skeleton coordinates, the covariance matrix of the joint position trajectory is calculated to build the temporal model of skeleton sequence [2]. In [3], the paired relative positions of joints are also used to describe the posture and joint changes of the skeleton sequence, and the principal component analysis is applied to normalize features to obtain the representation of the principal features. In [4], the rotation and translation between body parts are used as features, and the Fourier temporal pyramid (FTP) is utilized to model the temporal dynamics. These methods pay more attention to the spatial relationship of the joints in the skeleton behavior, which weakens the attention to the temporal features to a certain extent.

For the temporal information, Wang et al. [1] calculate relative positions of each joint and other joints to represent each frame of the skeleton sequence and then model temporal information. In [6], the histogram of the 3D joint position is calculated to represent each frame of the skeleton sequence, and HMMs are used to model the temporal dynamics. Kim and Reiter [7] propose to use temporal convolutional neural network (TCN) for 3D human behavior recognition. Compared with the popular LSTM-based recurrent neural network model, the TCN-based model is more intuitive and interpretable [7]. These methods can take the spatiotemporal features of behavior into account, but may ignore some spatial features that are globally related and cannot closely link temporal and spatial features.

In the method based on spatiotemporal features, Yan et al. [8] design skeleton sequence graph containing temporal information and use the spatiotemporal graph convolution network to learn the spatiotemporal features in the behavior sequences. Ke et al. [9] use a deep convolutional neural network to obtain the temporal features of the skeleton sequence, use a multitask learning network to process all the frames of the generated fragments, and finally combine the spatial information for behavior recognition. Some scholars use graph convolutional network (GCN) combined with LSTM or dual-stream network structure [5, 1012] to extract spatiotemporal information in behavior sequences. These methods can pay attention to the close relationship between temporal and spatial features, but since behavior features also have the temporal and spatial co-occurrence, these methods cannot accurately describe this property.

To learn more discriminative spatiotemporal manifold features by the deep model, we need to comprehensively consider the spatiotemporal co-occurrence relationships between the connected and disconnected skeleton parts. To this end, we intend to represent the spatial structure based on the transformation group for each frame of original nonrigid 3D skeleton behavior sequences and use the Riemannian manifold to construct the relative spatial transformation relationships between all pairs of skeleton parts. This spatial structure representation method can describe the relative motion relationship between all pairs of skeleton parts in a frame as a point in the high-dimensional Riemannian manifold space.

Since each action sequence consists of many frames, we employ an interpolation method based on the transformation group to integrate the points in the manifold surface space into a transformation group curve, so as to model the co-occurrence relationship of the spatiotemporal features of original 3D skeleton sequence. However, directly inputting features with manifold constraints into neural network will bring high time and space complexity. Currently, it is difficult to use the neural network to mine rich information contained in manifold input while preserving the manifold constraints. To this end, Wang et al. [13] propose a GCN-based method to solve the problem of edge prediction between nodes. Inspired by this method, we try to treat an action as node, construct similarity graph of all nodes based on its manifold trajectory, use graph convolution to predict connections, and finally achieve the classification of behaviors. With respect to this idea, the difficulty to be solved is how to construct graph of feature nodes in manifold space.

The graph construction method is currently commonly used in determining the similarity of members in social network analysis [14, 15], and the constructed graph is used for intelligent recommendation. In these applications, the multidimensional features of the task are usually data in Euclidean space, and existing methods such as KNN [16] can solve this problem. However, in the application scenario of our problem, we hope to realize the construction of behavior feature nodes on manifold space. Therefore, in this study, a graph construction method based on the Riemannian metric on manifold is proposed. This method can take full advantage of rich information of data on manifold. At the same time, the Riemannian metric method can map behavior nodes isometrically into projected space.

This study proposes a 3D behavior recognition method based on spatiotemporal trajectory graph construction, whose description of framework is shown in Figure 1. This method uses Riemannian metric to measure the spatiotemporal trajectory properties, which make similar nodes closer and dissimilar or different types of nodes far apart. The model mainly has the following stages, data preprocessing, Riemannian metric graph construction, graph convolution, and behavior classification. In the data preprocessing stage, we process the 3D coordinate data of the skeleton sequences into a behavior trajectory curve representing relative behavior relationship between any pair of bones. In order to express as much spatial information as possible to reflect rich spatiotemporal co-occurrence, we calculate the relative behavior relationship between any two bones. In the stage of Riemannian metric graph construction, we roll and expand the processed manifold spatiotemporal trajectory curve along the direction of the trajectory into a corresponding continuous rolling tangent space curve. This process tries to ensure that the distance between any two points in a tangent space curve is equivalent to the distance between two points in the original manifold, use DTW to measure the similarity between curves, and use the similarity between behavior nodes to construct a similarity graph. In the graph convolution stage, through the update between each iteration of graph convolution, similar nodes are pulled closer and different are pushes apart so that behavior nodes of the same category are gathered together. Finally, in the classification stage, the labels are spread from the central point of each cluster to achieve the classification of behaviors. The main contributions of this study are as follows:(1)For skeleton sequences, we extract rotation and translation relationships from bone pairs and represent them as discrete trajectories in Riemannian manifold, which can describe spatiotemporal co-occurrence and global relative relationships.(2)We propose a graph construction method based on continuous projections on Riemannian manifold, which is employed to map the spatiotemporal trajectories on the manifold isometrically to preserve more complex similarity distribution relationship between manifold features.(3)We propose a deep heterogeneous manifold model consisting of two subnetworks with different structures. It incorporates an end-to-end optimizable manifold backbone network, which exploits the powerful representative ability of Riemannian manifold and can be guided by the subsequent graph-based subnetwork.

2. Spatiotemporal Manifold Trajectory Representation

To fully exploit the nonlinearity of behavior data, we represent them as curves in manifold space. Specifically, we represent it in the Lie manifold in the form of Cartesian product, which can contain rich spatiotemporal co-occurrent relationships.

Given 3D coordinates of the joints of the skeleton behavior sequence, we assume that the number of frames of an behavior sequence is , and the number of joints is , so the coordinate of the th joint in the frame is expressed as , and the 3D coordinate of a behavior sequence is represented as . With these 3D coordinates and the body structure data given in the dataset, i.e., the above joint points are connected in the body structure, here we might as well assume that the joint and the joint are the two ends of the bone in the first frame, and this bone can be represented as ; in this way, a bone can intuitively be represented as a vector in 3D Euclidean space, and the set of bones can also be obtained. Since the spatiotemporal graph of the body structure in the current skeleton data are all acyclic graphs, the number of bones is . In the skeleton of body, the relationship between any two different bones is pairs.

The elements in the trajectory manifold have the following constraints:where is special Euclidean group and is special orthogonal group.

The manifold trajectory using relative relationships has the following advantages:(1)The features used to represent the rotation relationship between skeletons are scale invariant; in other words, no matter how large the scale is to represent the skeleton, the rotation relationship between the skeletons is unchanged(2)The relative relationship of has spatial co-occurrence, i.e., we can explore the relationship between not only any two bones but also spatially connected skeleton pairs(3)Representing the relative relationship of the skeleton based on the trajectory curve can closely combine the spatial information and the temporal information, so different spatial features can be represented point by point to form a discrete curve on manifold space, which helps to increase the similarity of features with the similar temporal information

3. Backbone Network of Deep Heterogeneous Manifold Network

3.1. Riemannian Manifold Preservation Network

Since the input data of our deep Riemannian manifold network is the initialized high-dimensional Riemannian manifold transformation group data, it is necessary to maintain the richness and topology of their nonlinear structures during the feature learning process. The commonly used Euclidean spatial convolution layer may destroy this property, so we employ a convolution-like Riemannian transform layer that contains transform parameters optimized for deep model learning and whose output still conforms to the Riemannian manifold constraints, which preserve the Riemannian manifold topology of the data.

According to the above description, we know that the feature is a set of points in the motion group , which is represented by the discrete curves’ form on the manifold of the Lie group [17, 18]. We denoted this manifold as , and the set of points is ; then, the feature of the th frame in the th behavior is represented as . Since any point on the manifold has constraints: if we have any , then and , where is the identity matrix, which is also the identity element on the manifold, and is the operation to find the value of the determinant. So, there is

If we have , then .

This property can be summarized as

The matrix has the invertible property . Therefore, the behavior trajectory curve is in the form of .

The initialized high-dimensional Riemannian manifold transformation group data are also a spatiotemporal co-occurrence representation of the original 3D data, thus requiring spatial and temporal pooling techniques on the Riemannian manifold. We can not only reduce the data dimension and preserve topology but also further obtain more discriminative spatiotemporal manifold features between action sequence frames.

3.2. Graph Construction Based on Manifold Trajectory

On the obtained manifold trajectory curves, we use the Riemannian similarity metric method to construct graph for the behavior features on Riemannian manifold. The distance on a manifold is obtained by measuring geodesics on the manifold. To ensure that the distance between any two points on the manifold remain constant in the constructed graph, we can map the points on the manifold isometrically to a convenient measurement space. The implementation process of the graph construction method based on Riemannian similarity metric is shown in Algorithm 1.

Input: trajectory curves of all skeletons ; behavior sequence label in training set ; total number of behavior categories ;
(1)for Given behavior category do
(2)  Calculate the average trajectory curve of each class on the manifold;
(3)  Average trajectory curve ;
(4)end for
(5)for all Training trajectory curve with label do
(6)  Continuously project training trajectory curve along the average trajectory curve ,
(7)  Obtain the curve features on the tangent space after continuous projection;
(8)end for;
(9)for all Training trajectory curve do
(10)  Given test set trajectory curve
(11)  for; ; do
(12)    Continuously project test set trajectory curve along the average trajectory curve ;
(13)  end for
(14)  Continuously unfold test set trajectory curve along the path of average curves, obtain a set of curves
(15)  Calculate the set of similarity scores between each curve in the curve set and the corresponding average curve ;
(16)  Obtain the features under the score reflecting to the highest similarity;
(17)end for;
(18)for all Training trajectory curve do
(19)  Given a curve feature, use DTW to calculate the most similar trajectory curve to this curve;
(20)  Get adjacency list ;
(21)end for;
(22)for all Test track curves do
(23)  Given a curve feature, use DTW to calculate the most similar trajectory curve to this curve;
(24)  Get adjacency list ;
(25)end for;
Output: Curve features of training set and test set after continuous projection; The adjacency list obtained of the training set and test set ;

The dimension of the matrix is 6, which brings high computational and space complexity to operations such as multiplication and inverse. Therefore, in this study, we do not use the method of directly calculating the distance between two points on the manifold. We explore the use of a certain method that can isometrically map the points on the manifold to a space that is convenient for measurement. If we directly expand the projection at a point, for example, we expand at the pole, the result may be that the closer to the pole, the more similar the curve after projection is to the original curve on the manifold, and the farther away from the pole, the more distorted the curve is after projection. Inspired by methods of geodetic distance [19], we propose a method for measuring the distance of a curves on manifold based on a continuous projection.

Figure 2 shows a continuous projection of a behavior trajectory curve on the manifold along the quasi-average curve to its corresponding tangent space. In the curve on the manifold, we use the continuous projection method along the average curve of the class (i.e., the dotted line in the figure) to project the points on the curve one by one into the tangent space. The lengths of line segments , , and on the manifold are, respectively, equal to the lengths of , , and of the corresponding tangent space.

Below, we explain this continuous projection process in detail. Specifically, the continuous projection mapping on the manifold is a smooth mapping : along a smooth average curve :

In particular, this rolling continuous mapping needs to meet the three conditions defined in [20] at any time , namely, rolling conditions, no-slip conditions, and no-twist conditions. The continuous projection is a continuous map that satisfies the above three conditions and maps the manifold trajectory to the corresponding tangent space.

Since the area near the point on the Lie group manifold is smooth, any point in this area can be represented by a slight rotation and translation change from a point to its neighbors. Assuming that is a point on the manifold space of , is a curve on starting from when , and at any subsequent time, you can find a point on the curve corresponding to that time. We can find such a smooth curve; then, this meets the continuous projection condition. Since our calculation cannot exhaust every point on the continuous curve, in order to facilitate the calculation, in the following calculation, we will continue to project the points on the curve frame by frame. Under the three constraints of manifold described above, this mapping process can be expressed aswhere denotes semidirect product symbol and is the solution of the motion equation in the projection process at time .

This process is a continuous projection along the curve on the Lie group manifold ; the curve has the following expression:

is the expansion of the curve under the effect of continuous projection at :

Suppose we perform continuous projection in the time interval on a certain behavior curve. Since the curve on the manifold we use is discrete on the time axis, we get the corresponding points in the mapping space. It is .

Using the continuous projection method, the process of obtaining the similarity between the behavior curves from the manifold space is shown in Figure 3. We take the three points A, B, and C of a certain behavior curve on the manifold as an example. After continuous projection, they correspond to the three points a, b, and c in the tangent space. Our method aims to make the distances between AB, BC, and AC on the manifold are basically similar to the mapped distances ab, bc, and ac, especially to ensure that the distances between nodes of the same category are as similar as possible.

The projection method based on the tangent space of a certain point has a problem, that is, the closer the data to the projection point, the better the retention of features and local similarities between the data. On the contrary, the farther away from the projection point is, the relative distance of the data is pulled away after being projected, which causes the local similarity of the data far from the projection point and the global similarity of the whole data to be destroyed. We keep the local similarity and global similarity between nodes as much as possible in the projection process, avoiding the distortion of the distance between nodes that affects the subsequent node classification.

Generally, the behavior curves of a certain type on the manifold does not completely coincide with the geodesic. In particular, when this continuous projection curve satisfies certain constraints, the continuous projection curve we get degenerates into a geodesic curve. In a part of the projection of a certain point, the curve on the manifold and the two curves in the corresponding tangent space have the same geodesic curvature. That is to say, the geodesic curve is a projection curve that meets certain constraints, so the applicable range is narrow. Our continuous projection method can be applied to more manifold projection scenes; expanding average curve of a class along the behavior curve can better measure the similarity between different classes.

4. End-to-End Optimizable Graph-Guided Heterogeneous Model

In the previous 3D action recognition methods based on deep learning, most methods usually use a fully connected layer at the end of the backbone network and use cross-entropy loss to complete the task. In the iterative learning process, they do not fully consider the similarities and changes between deep features of similar actions as well as the differences between deep features of different action categories. Since the output of our backbone network is still topologically preserved Riemannian manifold data, we need a construction method of nearest neighbor graph on a high-dimensional Riemannian manifold surface to model local similarities, combined with graph convolutional network to achieve deep global similarity prediction to guide the feature learning of backbone network. This can make full use of the potential local similarity relationship in the local context information of each action sequence so that the whole heterogeneous network can integrate the common features of the same category and suppress their changes and at the same time expand the differences of different categories through the aggregating capability of graph convolution.

Our deep heterogeneous manifold network consists of two subnetworks with different structures. The former is the backbone network for learning deep manifold spatiotemporal features, and the latter is the graph convolution-guided learning subnetwork, which is built on the previous trajectory curves. In the backbone network, two pooling learning submodules are added to learn more discriminative features for further promoting of the graph convolutional network. In an end-to-end manner, the latter subnetwork can guide the feature learning of the former backbone subnetwork. However, its backpropagation will be more complicated, and the whole heterogeneous model is built on the Riemannian manifold, making the optimization problem with manifold constraints. If the manifold is embedded in linear space, the dimension problem will increase, thereby increasing the complexity. It is very difficult to optimize in Euclidean space. However, in some specific Riemannian manifold, the constraints can be eliminated to become unconstrained optimization, so we consider to solve an end-to-end optimization problem directly on the Riemannian manifold.

In the first module of the trajectory curve feature learning part, we set the learning parameter in a Lie group manifold and then perform a spatial pooling on the data that has undergone manifold learning so that we can select more discriminative spatial features learned by the previous layer, and it reduces the computational complexity of spatial features and facilitates the subsequent computation. Similarly, the second module also sets a learning parameter in the Lie group manifold and then performs a temporal pooling on the data. In this way, on the one hand, it is possible to select more discriminative temporal features after learning from the previous layer, and on the other hand, it reduces the computational complexity of temporal features.

Given and , we suppose that the data passed in each time are . Due to the retention of Lie group operations, there is

Therefore, in this part, the network parameters’ learning is constrained in the Lie group manifold. In the graph-guided convolution module, we loop all behavior nodes, put all nodes into a queue, construct a domain subgraph with each node as the central point, and predict the connection relationship between the included peripheral nodes and the central point. As a result, a set of edges whose weights are the connection probability can be obtained. In order to cluster similar nodes together, a simple method is to prune all edges whose weights are lower than a certain threshold and use breadth-first search method to propagate pseudolabels. In each iteration, the edge is updated below a certain threshold, and in the next iteration, the connected clusters are greater than the predefined maximum value. In the next iteration, the threshold for updating the edge is increased. Repeat this loop process until the queue is empty. At this time, all nodes have been marked with pseudolabels of the category. We take the label of the central node of each cluster to propagate, i.e., the classification of nodes is realized.

5. Experimental Verification

5.1. Dataset Description
5.1.1. G3D Dataset

This dataset is a skeleton-based dataset [21] collected from game data. It contains 10 participants, who perform 20 categories of game behaviors. Most behavior sequences are recorded by a specific camera in a controlled indoor environment. Participants perform basic behaviors in strict accordance with instructions, and each sequence was repeated 3 times by each subject. Nevertheless, participants are free to complete the collection of different exercise sequences according to their own exercise habits. The dataset contains manually labeled behavior category labels for all sequences.

The skeleton in this dataset consists of 20 joints, and the position of the participant’s joints is expressed in X, Y, and Z coordinates in meters. The skeleton data also includes a joint tracking state, including accurately tracked joints, imported joint coordinates, and predicted joint coordinates. In many cases, the predicted joints are accurate, but in some cases, the limbs are occluded and the predicted joints may be inaccurate. Since some joint points in the dataset are obtained through prediction, the accuracy of the final classification will be affected to a certain extent if the predicted joints are inaccurate.

5.1.2. HDM05 Dataset

The behavior sequences in this dataset are performed by 5 nonprofessional actors [22]. Most of the behavior sequences are performed multiple times by all five actors according to the specific instructions in the script. The script contains five parts, and each part is divided into several scenes. Each behavior sequence is only collected in the corresponding single scene. The skeleton in this dataset consists of 31 joints, and the 3D coordinates of the joints are represented in X, Y, and Z coordinates in centimeters.

Although the dataset is small in scale, the behavior categories are more detailed, with a total of 130 behavior categories, some of which may look similar. Therefore, this dataset is also somewhat challenging.

5.1.3. NTU-RGBD Dataset

The NTU-RGBD dataset contains 60 behavior classes and 56880 video samples [23]. This dataset contains RGB video, depth mapping sequence, 3D bone data, and infrared (IR) video for each sample. Each data is captured simultaneously by 3 Kinect V2 cameras. Here, we use three-dimensional skeleton data, and the three-dimensional coordinates of the joints are expressed in X, Y, and Z coordinates. The three-dimensional skeleton data contain the three-dimensional coordinates of 25 human body joints per frame. The original benchmark provides two evaluation methods, namely, cross-subject (CS) and cross-view (CV) evaluation. In CS evaluation, the training set contains 40,320 videos from 20 subjects, and the remaining 16,560 videos are used for testing. In CV evaluation, 37920 videos captured from No. 2 and No. 3 cameras were used for training, and the remaining 18,960 videos from No. 1 camera were used for testing.

This dataset is widely used in skeleton-based behavior recognition. It has several scene categories, including daily behaviors, medical scenes, and multiperson sports. Since it contains both single-person sequences and multiperson interaction sequences, it is quite challenging to perform recognition tasks on this dataset.

Table 1 summarizes the main data distribution characteristics of the above three datasets. It can be seen that the number of joints and the number of bones selected in the three datasets are roughly similar, and the number of frames in each behavior sequence varies widely, ranging from a few frames to a few hundred frames, i.e., it is linearly adjustable within certain limits. From this perspective, it is very important to fully dig out the temporal information to complete the task of behavior recognition. Judging from the number of behavior sequences contained, the scales of the three datasets from small to large are G3D-Gaming, HDM05, and NTU-RGBD; from the perspective of the divided behavior categories, HDM05 has the most behavior categories, indicating the classification of behavior sequences is finer, and the corresponding recognition difficulty is also greater. In addition, in order to further improve the generalization ability of recognition in the future, we have implemented a behavior recognition data acquisition system with multichannel video input. The system can be connected to the mainstream RGBD cameras on the market, and the number of channels is linearly adjustable within a certain range. The collected videos can be processed into the current major formats, for example, AVI, MPEG, and MP4. We can estimate the 3D skeleton sequences as datasets from the collected video data.

In the G3D dataset and HDM05 dataset, we follow the principle of cross-validation experiment, using half of the dataset for training and the remaining half for testing. The experimental settings of the NTU dataset adopts the commonly used cross settings, including the cross subject and cross view. In order to keep the number of frames consistent for all behavior sequences, we downsample the execution frames of the skeleton sequences so that each dataset has a fixed number of frames. The number of frames selected for the G3D dataset is 100, the HDM05 dataset is 300, and the NTU dataset is 300. For the three datasets, we apply similar normalization preprocessing to achieve the invariance of position and view changes.

5.2. Experiment and Comparative Analysis

We first test the classification result of the proposed method on the G3D dataset. The 663 sequences in the dataset are divided into the training set and the test set according to the participating objects. The behavior sequences performed by the participants 1, 3, 5, 7, and 9 are used as the training set, and the behavior sequences performed by the remaining participants are used as the test set; thus, 333 training set sequences and 330 test set sequences are obtained.

Due to the small size of the dataset, we consider that the number of neighbor nodes’ set when constructing the graph is relatively small. In the update process of graph convolution, around each node, the closest node and the 11 closest nodes around it are selected. Initially, they are considered to be of the same class, and then, the edge weights are updated.

The experimental results on G3D dataset are shown in Table 2. From the data in the table, it can be seen that the proposed method has better performance than the previous methods. The reason is that the previous method directly expands the manifold data and inputs them into the network for learning. In this process, some manifold constraints are destroyed, making the latter network unable to mine the rich information originally contained on the manifold data. The proposed method continuously projects manifold curves into the corresponding projection space along the average curve of the class, which can keep the distance between the curves projected from manifold curves as consistent as possible. In this way, the subsequent graph convolution can use the similarity between the projected curves to classify.

The proposed method has an improvement of 1.59% compared with the method combining deep neural network. This is due to the fact that the spatiotemporal trajectory can mine more abundant co-occurrent features, and using these features, we can achieve better similarity construction. Graph convolution network in the following can improve the classification result through pulling similar nodes closer and pushing others far apart.

In the HDM05 dataset, we randomly select half of the behavior sequences from each class as the training set and the remaining half as the test set. There are a total of 2343 behavior sequences in the dataset and 130 detailed behavior categories. Each category has an average of less than 20 behavior sequences. After dividing the training set and the test set, the training set and test set have about 10 behavior sequences for each category. Therefore, in the update process of graph convolution, one of the closest nodes around each node and the 7 closest nodes around it are selected.

The experimental results on the HDM05 dataset are shown in Table 3. The proposed method is compared with the method that only uses the manifold learning. There is about 20% improvement. We reckon that the continuous projection method based on the manifold curve can learn the features that contain rich spatiotemporal co-occurrence from the manifold data, and the similarity graph between behavior nodes is better constructed; thus, the graph convolution method can be used for further similarity learning. In this process, the method based on continuous projection can maintain the similarity between curves, especially the similarity between curves of the same category. This step is a key step to connect the manifold data and the deep network.

Compared with some methods using deep learning, such as PB-GCN [28], our method also has a certain improvement. The reason may be that the conventional deep learning network just arranges the data according to a certain dimension. For example, the data separated into different body parts are sent to the network for learning. In this process, the local behavior information of most of the skeleton coordinates can be used, but it is difficult to learn the essential complicated features of the relative relationship of the movement in the network. Nonetheless, the proposed network can use this information by learning the features of the manifold trajectory.

In NTU-RGBD dataset, we conduct training and testing according to the currently commonly used data division and conduct subject-cross and view-cross experiments, respectively. Due to the large number of behavior sequences for each category in the dataset, each node cannot be directly connected to its peers when constructing a graph. When constructing the graph, 200 nearest neighbor nodes of each node are selected to form the adjacency list. In the update process of graph convolution, one of the closest nodes around each node and the 20 closest neighbors around it are selected.

The experimental results on the NTU dataset are shown in Table 4. The proposed method is greatly improved compared to the method that only uses the Lie group. The reason is that, after the graph construction by continuous projection, the introduced graph convolution module can leverage backpropagation to enhance the learning ability of the Lie group. Compared with some existing deep learning methods such as Deep-LSTM [23], ST-LSTM [29], TCN [7], and GCA-LSTM [30], our method also has some advantages. When these methods are mining behavior sequences, the main focus is on one of the temporal features and spatial features, and our method can organically combine the temporal and spatial features of the behavior characteristics by means of the manifold behavior trajectory. Compared with the current mainstream behavior recognition methods HCN [31], ST-GR [32], ST-GR [32], and ST-GCN [8], our method is still comparable.

5.3. Ablation Study

In order to verify the effectiveness of the proposed method, we performed ablation experiments on HDM05 dataset to validate each module. We have done five experiments to compare the method of directly stretching the manifold data into European data (Stretch), the method of logarithmic mapping (LogMap), the method of continuous projection (Ours/G), and the continuous projection combined with graph convolution.

The results of the ablation experiments on the HDM05 dataset are shown in Table 5. It can be seen from the table that the result of directly stretching the manifold data into the Euclidean data is the worst. In this process, the constraints of manifold data are broken, so a large amount of spatiotemporal information contained is difficult to be utilized by subsequent networks. The logarithmic mapping method can retain part of the data constraints by projecting the data into the tangent space. After projection, the data can still express most of the spatiotemporal feature information. Compared with the logarithmic mapping method, the method based on continuous projection still has a lot of improvement, which shows that the continuous projection maintains the stronger similarity of the data after the projection than the logarithmic mapping. Finally, the method of continuous projection combined with graph convolution achieves the best results, which shows that the graph convolution method used here can achieve the function of pulling similar nodes closer and pushing others far apart to improve the classification result of the algorithm.

6. Conclusion

In this study, a deep heterogeneous manifold network is proposed. It incorporates a graph construction method based on Riemannian metric, which can preserve the nonlinear constraints of the spatiotemporal trajectory to a large extent and obtain better data projection through continuous projection. The graph nodes of behavior sequences built by this method are input to graph convolutions to realize the clustering and classification, which can improve the classification result of behavior recognition. The whole architecture combines a manifold learning backbone subnetwork and a graph convolutional network. The two parts learn from each other through end-to-end optimization, and manifold-based graph construction can guide the manifold network. The proposed method has been validated on several mainstream skeleton-based datasets and achieved competitive results. In the future, we will investigate how to automatically learn features represented in Riemannian manifold from raw data, which will further improve the discriminativeness of Riemannian representations.

Data Availability

All datasets are public datasets that can be downloaded online. G3D dataset is publicly available at https://dipersec.king.ac.uk/G3D/G3D.html, NTU RGB + D dataset is publicly available at https://rose1.ntu.edu.sg/dataset/actionRecognition/, and HDM05 dataset is publicly available at https://resources.mpi-inf.mpg.de/HDM05/.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Jinghong Chen and Li Zhang are contributed equally to this work.

Acknowledgments

This work was supported by the Shenzhen Science and Technology Programs under Grant nos. JCYJ20180306-173210774 and JCYJ20200109143035495.