Abstract

Gait recognition is a powerful tool for long-distance identification. However, gaits are influenced by walking environments and appearance changes. Therefore, the gait recognition rate declines sharply when the viewing angle changes. In this work, we propose a novel cross-view gait recognition method with two-way similarity learning. Focusing on the relationships between gait elements in three-dimensional space and the wholeness of human body movements, we design a three-dimensional gait constraint model that is robust to view changes based on joint motion constraint relationships. Different from the classic three-dimensional model, the proposed model characterizes motion constraints and action constraints between joints based on time and space dimensions. Next, we propose an end-to-end two-way gait network using long short-term memory and residual network 50 to extract the temporal and spatial difference features, respectively, of model pairs. The two types of difference features are merged at a high level in the network, and similarity values are obtained through the softmax layer. Our method is evaluated based on the challenging CASIA-B data set in terms of cross-view gait recognition. The experimental results show that the method achieves a higher recognition rate than the previously developed model-based methods. The recognition rate reaches 72.8%, and the viewing angle changes from 36° to 144° for normal walking. Finally, the new method also performs better in cases with large cross-view angles, illustrating that our model is robust to viewing angle changes and that the proposed network offers considerable potential in practical application scenarios.

1. Introduction

Gait recognition is an emerging biometric recognition approach used to identify people by their walking patterns. Unlike biometric identifiers such as faces, fingerprints, and irises, gaits have unique advantages, such as being noncontact, noninvasive, and hard to hide and forge, because of their characteristics during continuous cyclical movements. Early methods of gait feature analysis were used in the medical field, such as for the early diagnosis of Parkinson’s disease [1], cerebral palsy [2], and other diseases. In recent years, an increasing number of cameras have been installed for security surveillance and crime forensics. Gait recognition is the most suitable human identification technology for long-distance monitoring. Another competitive advantage is that subjects also do not have to actively collaborate with other related equipment. Therefore, gait recognition offers unique advantages in terms of video surveillance. For example, the United Kingdom and Denmark have used gait information as criminal evidence. Despite the advantages of gait recognition, one of its greatest challenges involves removing factors that do not relate to human identity. These unavoidable influencing factors include walking mannerisms, such as walking speeds, clothing and carried objects, and walking environmental factors, such as camera viewing angles, road surfaces, human body occlusion, and illumination conditions. Such scenarios cause large intraclass gait feature fluctuations due to their lack of robustness. In the past 10 years, many works have researched robust gait feature extraction and have achieved promising results. In practical application scenarios, viewing angle changes are the most difficult factors due to the nonrigid movements of the human body. Therefore, the most challenging problem for cross-view gait recognition remains an active research topic.

The gait recognition problem has been commonly studied from the perspective of computer vision, as computer vision is not intrusive with respect to this subject. The first gait recognition task is to build a model for describing gait characteristics. Appearance models form a popular current modeling method that integrates dynamic and static gait pattern information from video or image sequences into a two-dimensional (2D) feature image. Gait energy images (GEIs), motion silhouette images (MSIs), and motion history images (MHIs) are commonly used in appearance models. These appearance-based methods extract view-invariant gait features directly from planar images and achieve good results when the encountered viewing angle changes are relatively small. When the viewing angle between two subjects increases, the performance of these models decreases drastically because the features obtained from a 2D image can abundantly express gait information, but they weakly illustrate spatial change relationships. Gait is a characteristic of the human body as it moves periodically in three-dimensional (3D) space. Regardless of how often the viewing angle changes, the structure of the human body can satisfy the view-invariance characteristic. Studies in [3] showed that human joints can express gait changes. The mutual restraint relationships between joints can express the gait element characteristics, such as limb motion ranges, step lengths, foot angles, and center of gravity positions. The vector relationship formed between joints is also rotation invariant. Therefore, gait feature modeling through 3D joint poses is more in line with the spatial motion characteristics of the human body than appearance-based modeling.

Most methods based on deep learning extract gait invariant features and then measure the similarity levels for judgment purposes. When there are many sample categories and the number of samples in each category is relatively small, it is difficult to use the classification algorithm to achieve good results. At the same time, gait recognition is often necessary to directly judge the similarity of input pairs rather than solving classification problems in practical scenarios. From the perspective of gait verification, some scholars have designed methods that can directly discriminate gait similarity. To directly determine the similarity of a pair of GEIs, Zhang [4] developed an end-to-end gait measurement method based on a Siamese network. In [5], a convolutional neural network (CNN) based gait similarity measurement network was proposed by using the feature fusion technique. The paper concluded that the developed network can extract the gait feature differentials when two comparison GEIs are input directly.

Considering that the human body moves in three-dimensional space, the kinematic relationships between joints exhibit view invariance. In this work, we propose a novel two-way similarity learning method for gait recognition based on 3D human body poses. Using the constraint relationship between joints in 3D space, a view-robust 3D motion constraint matrix is constructed from the two dimensions of space and time. Then, an end-to-end similarity learning network is proposed to automatically learn the different features of input pairs. The network structure by which the spatiotemporal difference features of 3D gaits are learned is a hybrid of long short-term memory (LSTM) and residual network 50 (ResNet-50). The contributions of our work can be summarized as follows:(1)Focusing on the overall motion constraint relationship between joints rather than the local physical motion characteristics of a single joint, a 3D gait feature model that can represent the spatiotemporal motion constraints between joints is constructed.(2)We develop an end-to-end two-way spatiotemporal similarity learning network using LSTM and ResNet-50 to perform nonlinear feature modeling on joint motion constraint difference features and spatial action difference features and directly predict the similarity of input pairs.(3)Compared with some state-of-the-art methods, the proposed 3D gait constraint model achieves better viewing angle robustness and cross-perspective recognition stability through learning temporal and spatial differences features.

The rest of the paper is organized as follows: In Section 2, the related works on gait recognition are reviewed. In Section 3, the 3D constraint modeling method and the proposed 3D gait similarity learning network framework are introduced. Finally, the proposed method is evaluated in Section 4, and the conclusion of this paper is presented in Section 5.

The first task required for gait recognition is to build a gait model that can express gait characteristics from sequence images. Then, given a probe model and gallery models, abstract high-level gait features are obtained via machine learning. As a simple implementation, the feature similarity between each probe and gallery image pair is calculated. Finally, the gallery closest to the probe is selected as the probe identity. According to the gait model type utilized, gait recognition methods can be divided into two typical categories: appearance-based and model-based methods. The former methods extract gait representations directly from videos, and the latter model the underlying structure of the human body. Our work in this paper falls within the latter category.

2.1. Appearance-Based Methods

Appearance-based methods first obtain dynamic and static information from periodic and continuous image sequences according to specific rules. There are two types of apparent methods: template- and sequence-based methods. The template-based method constructs a two-dimensional gait feature image (also called an energy-like image) that is generated by superimposing this obtained information. Different energy-like images require specific methods for expressing the temporal and spatial gait characteristics and offer specific strengths. Commonly used images are GEIs, MSIs, and MHIs. GEIs are the most commonly used images for appearance-based models. Sequence-based methods use pedestrian silhouette sequences as input to extract features. After this, the discriminative gait feature representation of the model is obtained by machine learning. In recent decades, appearance-based methods have produced many satisfactory results. As mentioned before, one of the difficulties for appearance models is that appearances change due to viewing angle changes. The following three approaches are commonly used to obtain view-invariant gait features.

Traditional feature extraction methods with constant viewing angles include the VTM (view transformation model), CCA (canonical correlation analysis), and subspace learning.

VTM projects gait features from one view into another. Makihara et al. [6] first proposed transforming the view of a gait template. A VTM is constructed from view-independent and object-independent matrices based on singular value decomposition (SVD). Kusakunniran et al. [7] introduced truncated SVD to avoid oversizing and overfitting the model. On this basis, the following methods further optimize the VTM method: introducing support vector regression (SVR) to model nonlinear correlation [8] and constructing a shared high-level virtual path to project a single model for different viewpoints on a single canonical view [9]. CCA projects each gait feature pair into two subspaces with maximal correlations. Bashir et al. [10] mined the low-dimensional geometric structure of feature data by CCA and learned the projection matrix of a specific view. Xing et al. [11] aimed to solve the inconsistency of CCA projections when processing two high-dimensional data sets and proposed a complete CCA method to improve the performance of the original CCA technique. Because some weakly correlated or noncorrelated information may exist in global gait features, Kusakunniran et al. [12] applied CCA to groups of features obtained from the motion coclustering of global gait features. Subspace learning projects a different view of a common subspace to measure similarity. Xu et al. [13] presented multiview max-margin subspace learning (MMMSL) to learn a common subspace for associating gait data across different views. Connie et al. [14] used dual-core principal component analysis (PCA) to perform coefficient expansion to establish a nonlinear subspace, forming a Grassman manifold to describe multiview gait features. The structure of a nonlinear subspace can more appropriately retain the gait characteristics when the viewing angle changes. The above methods alleviate the impact of view changes on gait recognition accuracy to a certain extent. However, they are still not effective in solving the highly nonlinear correlations among gait features under different views.

After 2015, deep learning methods began to appear in the gait recognition field [1519]. CNN-based methods [9, 2022] combine feature extraction and gait classification into one task. Furthermore, the nonlinear mapping and hierarchical feature extraction capabilities of deep networks dramatically reduce the impact of view changes on gaits. Yan et al. [23] extracted high-level gait features and introduced a multi-task learning model for state prediction during gait recognition. This method achieves better performance than gait recognition alone. Yu et al. [17] proposed a model based on an autoencoder to construct a VTM model for cross-view gait recognition. They fed walking states to the first two-layer encoder of the model, and each subsequent layer was converted by an 18° angle to reconstruct a GEI that was not affected by clothing and carried objects. In actual application scenarios, it is necessary to directly judge the similarity of two subjects while their identities are unknown. Some methods develop end-to-end gait judgment networks to determine the similarity of input pairs directly. Among these, the Siamese network is the most popular. It is a valuable tool that is significant for evaluating the similarity of pairs of different people. Zhang et al. [4] constructed an end-to-end gait similarity measurement-based Siamese network. Two convolutional subnetworks with the same structure extract the GEI features and judge the matching degrees of GEI pairs directly through a decision-making layer. Inspired by the Siamese network, Wu et al. [5] constructed gait difference feature learning networks for input pairs at the data level, feature level, and decision level and analyzed their performance. It was concluded that the data-level fusion network learns better gait difference features.

Considering that gait is a continuous movement characteristic, some scholars have also used silhouette sequences as inputs to achieve view invariance. The typical reconstruction method is based on LSTM [24] or a 3D-CNN [25, 26]. These methods fully use the temporal and spatial gait information and achieve better recognition results. Chao et al. [27] claimed that gait information has a cross-spatiotemporal correlation. They fed a disordered set of gait silhouette images to the GaitSet network to automatically learn the spatial structures and positional relationships of gaits. This method significantly improved the gait recognition rates of other approaches and has been regarded as a performance benchmark in the field of cross-view recognition. Fan et al. [28] focused on the motion characteristics of different body parts using the part-based model GaitPart to extract the spatiotemporal micromotion features of different parts from the silhouette sequence and achieved a high recognition rate. Hou et al. [29] used the GLN (gait lateral network) to learn both discriminative and compact representations from silhouettes. Then, the feature pyramid combining the silhouette-level and set-level features extracted by different stages was merged with the lateral connections in a top-down manner. Sequence-based methods introduce richer spatial information and add timing information, which improves the recognition rate compared to template-based methods.

2.2. Model-Based Methods

As mentioned before, model-based approaches model the structure or movement of the human body because of their insensitivity to human shapes or appearances. Kastaniotis et al. [30] used the human body skeleton and joint data obtained from a low-cost Kinect sensor to recognize different people. The results showed that body joints contain sufficient information for human identification. However, in video surveillance application scenes, precisely restoring body structures from videos captured under uncontrolled conditions becomes a research difficulty [3133].

Because gaits satisfy view invariance in 3D space, some methods attempt to use 3D imaging equipment or reconstruct 3D gait models of the human body in a collaborative multicamera environment. These approaches face problems when complicated camera parameter adjustments and modeling calculations are attempted. Therefore, accurate body part tracking becomes the bottleneck of model-based methods. Since 2016, the advent of human body pose estimation methods has made it easy to directly estimate human skeleton information from images or videos. These approaches have also advanced the model-based gait recognition technology research. Liao et al. [34] extracted 2D joint poses from video sequences by OpenPose2D and constructed a pose-based temporal-spatial network (PTSN) to obtain the spatiotemporal change characteristics of human joint sequences. In [35], pose-based LSTM (PLSTM) was used to reconstruct the view of a 12-joint heatmap sequence to reduce the impact of view changes. This method cannot wholly create a joint heatmap of gait sequences with more than three cross views simultaneously. However, while 2D poses are not view invariant, 3D poses are. With the maturity of OpenPose3D technology, the estimation of 3D skeleton data from 2D skeleton data in real time has become more accessible. Recently, Liao et al. [36] further used OpenPose3D to establish a motion constraint model for joint poses and extracted 3D joint motion features using a CNN. This method achieved a recognition rate of more than 80% for cross views of the same walking state. As mentioned above, model-based methods, especially 3D modeling methods, are reasonable gait recognition solutions.

Despite their good performance, the moving background and occlusion will have a certain impact on the apparent modeling in the actual scene. The 2D pose model is essentially a flat expression method for gait features that is also not view invariant. Therefore, human body pose estimation possesses a greater tolerance for interference from the external environment, which shows its great potential in actual gait recognition scenario application. Therefore, there remains much room to improve model-based methods.

3. Proposed Method

The gait is a periodic movement of joints driven by limbs, and it exhibits the characteristics of a constant viewing angle in three-dimensional space. Based on this, a visually robust 3D motion constraint matrix is constructed to express the motion constraint change relationship between joints. Starting from the requirements of gait verification tasks in video surveillance scenes, an end-to-end two-way similarity learning method is constructed to model and express the spatial and temporal differences of 3D motion constraints and directly judge the similarity of input pairs. The framework of the new method is shown in Figure 1.

The overall method consists of three components: a 3D joint coordinate matrix extraction (JCME) module, a 3D gait constraint matrix generation (GCMG) module, and a two-way similarity learning network (TSLN). The JCME module first automatically extracts 3D joint coordinates from continuous video frames using OpenPose3D and generates a joint coordinate matrix. After the JCME module, the GCMG module processes the obtained joint coordinate matrix to generate a 3D gait constraint matrix based on the time and space dimensions. This matrix expresses the spatiotemporal constraint relationships between joints. Finally, the 3D motion constraint matrix pairs are fed into the TSLN. The two branches of the TSLN extract the difference features of the motion constraint and action constraint. The two types of features are merged in a bottom-to-top manner to produce the final similarity output. The details of each process are described separately next.

3.1. 3D Joint Coordinate Matrix Extraction Module

To establish a 3D gait model independent of the viewing angle, the first task of the new method is to obtain the 3D coordinates of the human joints. The role of the JCME module is to generate a 3D coordinate matrix. Pedestrian videos are directly fed to the JCME module. Here, the pretrained OpenPose3D model is used to estimate the 3D coordinates of 14 joints (head, neck, LShoulder, LElbow, LWrist, RShoulder, RElbow, RWrist, LHip, LKnee, LAnkle, RHip, RKnee, and RAnkle). Positions and joint names can refer to the COCO output format of OpenPose from the video frames. The 3D coordinates of the i-th joint in the m-th frame are denoted by . Then, a 3D coordinate matrix is developed to express the mutual joint position relationship in 3D space. The row vectors of the 3D coordinate matrix are composed of the 14 joint coordinate values in the X, Y, and Z directions of each frame.

3.2. 3D Gait Constraint Matrix Generation Module

Earlier work showed that joint motion is sufficient to identify different subjects [37]. The research focused more on the local physical dependencies between joints while ignoring the implicit correlations among joint motions. From the perspective of medical research, gait elements such as the joint angle, step length, step width, position of the body gravity center, and step frequency are the characterizing gait factors. They compose an organic whole that interacts and constrains the movement of the joints driven by the limbs. The motion constraint relationship between the joints in the three-dimensional space can effectively express the spatiotemporal motion characteristics of the gait. Therefore, our goal is to establish a 3D gait constraint model to express the motion relationship of gait elements in 3D space.

From the previous module, we obtained the 3D coordinate matrix of 14 joints. Here, the main function of the GCMG module is to build a 3D gait constraint model. The main workflow of the GCMG module is shown in Figure 2. Through the work of a temporal constraint extractor and temporal motion constraint computation, temporal motion constraint vectors are generated. In the same way, the spatial action constraint vectors are generated through the work of a spatial constraint extractor and a spatial actions constraint computation. Then, the 3D gait model builder constructs a 3D gait constraint matrix combining the two types of vectors obtained. The 3D gait constraint matrix consists of a motion constraint matrix and an action constraint matrix. The implementation details are described in the following section.

3.2.1. Motion Constraint Matrix Construction

The temporal motion constraint relationships concern constraint change relationships between joints in time series. These change relationships are divided into two types: angular constraint relationships and proportional constraint relationships. An angular constraint relationship is composed of an adjacent joint angle, a foot angle, and a body gravity deviation angle. The ratio of the height to the step length and the ratio of the joint vertical height compose a proportional constraint relationship.

Figure 3 shows the constraint relationships between joints. Figure 3(a) shows the 14 joints estimated by OpenPose3D. In Figure 3(b), α describes the angular constraint between adjacent joints, denoted by . In Figure 3(c), describes the angular constraint of the foot direction, denoted by . describes the angular constraint of the gravity deviation, denoted by . These angular constraint relationships express the interactions between joints. In addition, movement information of the human body structure is critical for gait characterization. Feng et al. [33] used the changes in a joint heatmap to express gait characteristics while ignoring the movement changes in the human body structure. The dynamic proportional constraint of the human body structure expresses the overall movement characteristics of the human body. Two types of proportional constraint relationships are constructed here. The first is the constraint of the ratio between body height and step length. Figure 3(d) shows the vertical body height and step length (the proportion of z0 to d10–13), denoted by . The second is the constraint of the ratio between the vertical height of the body and the joints, denoted by . The calculation processes of the above five constraint relationships are defined in formulas (1)–(5). is the normal vector of the XOY plane in

3.2.2. Action Constraint Matrix Construction

Because the joint action of each frame is also different, spatial action constraint relationships can be used to express alternative joint position relations of the gait elements such as step length, step width, and joint displacement in 3D space. As shown in Figure 4, the displacement vector of the k-th joint in the m-th frame describes the action difference of the joints in the space dimension, which can be defined in formula (2), where represents the time interval between two consecutive frames.

According to formula (3), the 3D gait model builder of the GCMG module combines the five motion constraint vector types to generate a 3D motion constraint matrix named , which describes the motion constraint features of joints in a time series. The sequence vector consists of a 3D action constraint matrix named , which describes the action constraint features of joints in space. Matrices and are called 3D gait constraint matrices, and their structures are shown in Figure 2. The established 3D gait constraint matrix not only expresses the relationships among the joint constraint changes in time and space dimensions but is also robust to view changes.

3.3. Two-Way Similarity Learning Network

As mentioned earlier, the gait verification task is efficient and valuable because it is often necessary to judge whether the identities of two people are the same when the identities are unknown in real scenarios. The 3D gait constraint matrix generated by the GCMG module expresses view-robust gait features from both temporal and spatial dimensions. Instead of obtaining the gait difference features of input pairs from GEI, we built a two-way similarity learning network named TSLN to model nonlinear spatiotemporal difference features of 3D gait constraint matrix input pairs. In TSLN, the temporal motion constrained difference extractor learns the motion constrained difference features of , and the spatial motion constrained difference extractor learns the motion constrained difference features of . The spatiotemporal difference feature blocks fuse the two difference features obtained above and achieve the similarity prediction result of the input pair through two fully connected layers (FC-1 and FC-2) and a softmax layer. The structure of the TSLN module is shown in Figure 5.

3.3.1. Motion Constraint Difference Feature

This feature is obtained by the temporal motion constraint difference feature extractor called the LSTM branch. expresses the motion constraint relationship between joints in time series, so the LSTM network is used for nonlinear modeling of the temporal difference features of the joint motion of a pair of because of its excellent performance in learning the temporal dependencies of long data sequences. LSTM has memory units and a gate mechanism. Its special multiplicative units are called gates and are used to control the flow of information, and the memory units contain memory cells with self-connections that store the temporal cell states of the network. Paper [6] specified that one layer taking in two inputs can simulate the subtraction operation to compute the difference between a pair of features in the context of neural networks. Therefore, the input of the LSTM branch is the linear difference between the two inputs, that is, .

There are three sigmoid gates to protect and control the cell state at each sequence index position of each LSTM layer: the forget gate, input gate, and output gate. Here, the input of each sequence index is a combination of row vectors containing two at time t: that is, . The output information of the forget gate, input gate, and output gate is described in formulas (4)–(6), where is a sigmoid activation function, and , , and and , , and are the weights and biases of the forget gate, input gate, and output gate, respectively.

This branch consists of two LSTM layers and a flattening layer. The input is an 88-dimensional vector; the output of the flattening layer is a 4,400-dimensional vector, which represents the motion constraint difference feature of , denoted by .

3.3.2. Action Constraint Difference Feature

This feature is achieved by the spatial action constraint difference feature extractor called the ResNet-50 branch. expresses the spatial action constraint features of joints, so it is necessary to use a network that can express local features. ResNet-50 is suitable for this task. First, its properties such as convolution, nonlinear activation, and pooling are suitable for extracting local input-dependent features and generating high-level features. Second, the similarity between different subjects may be higher than that of the same subject due to similar body size and the influence of external factors. This can prevent the loss of low-level features in the process of high-level feature extraction in order to fully express the difference features of subjects. Third, the dimensionality of is relatively low compared to the appearance model, and vanishing gradient problems may occur during convolution. ResNet-50 can avoid the problem of vanishing gradients. Thus, ResNet-50 is used to model nonlinear action difference features from the linear action difference of the pairs, namely .

ResNet-50 is formed by repeatedly superimposing a large number of residual blocks. It sends the output of the initial layer to the following layers in a leap and then accumulates it together with the outputs of these layers. The residual block structure is shown in Figure 6. is the input of the residual block, that is, the difference feature of the previous block. is the learned residual mapping. The activation function is a rectified linear unit (ReLU), and the bias is omitted here. crosses two layers and is accumulated with the mapping feature as the output of the residual block. The output of this branch is the nonlinear weight difference feature .

The two channels of ResNet-50 are fed to 45-dimensional vectors to simulate subtraction operations . The two input parts are reweighted through the convolutional layer, and then new weights are added to obtain the simulated subtraction result. Finally, the flattening layer of this branch achieves a 4,096-dimensional action constraint difference feature, denoted by .

3.3.3. Spatiotemporal Difference Features

Following two branches, the spatiotemporal difference features block first merges and , that is, it performs operations. Through the two fully connected layers, the dimensions of are reduced to 128-dimensional 3D gait difference feature vectors. The last layer of this extractor uses the softmax loss function to directly obtain the 3D gait similarity value . The detailed parameters of the TSLN module are shown in Table 1.

4. Experiments and Results

Three experiments were conducted to evaluate the performance of the proposed method. The new method performance evaluation experiments were first conducted under cross-view and cross-walking conditions to determine the performances of the constructed 3D gait model and the similarity learning network. Next, two-branch evaluation experiments were conducted to determine the extraction capacity of the temporal and spatial difference features. Finally, comparative analysis experiments were conducted to compare the performance of our method and other popular state-of-the-art methods.

4.1. Data Sets

Currently, OU-MVLP and CASIA-B are popular public gait data sets widely used by the gait recognition research community. The OU-MVLP is the largest gait data set and is tailored for gait recognition methods that rely on silhouettes or GEIs. It does not provide RGB images. The proposed method obtains joint coordinates by body pose estimation from video sequences and then constructs a view-robust 3D motion constraint matrix model. Therefore, the proposed method cannot be evaluated on OU-MVLP due to its provided data. CASIA-B is a multiview gait data set provided by the Institute of Automation, Chinese Academy of Sciences, which provides pedestrian video sequences. Therefore, we use the CASIA-B gait database to evaluate the proposed method. The CASIA-B contains 124 pedestrians, where each pedestrian has 11 corresponding view angles (every two views are separated by 18°, i.e., 0°, 18°, …, 180°) and three kinds of walking conditions (walking with a bag (BG), walking with a coat (CL), and normal walking (NM)). The resolution of the video is 320 × 240, and the frame rate is 25 fps. To acquire the 3D gait constraint matrix of each person, we use OpenPose (developed by Carnegie Mellon University (CMU)) to obtain 14 joint coordinates of the human body and construct 120 × 11 × 10 × 2 gait constraint matrices with the GCMG module. Figure 7 shows some of the motion constraint change relationships of #001, #010, and #068. Figures 7(b)7(f) show some of the motion constraint relationships over a period of time. The motion constraints of different subjects display some differences under the three walking conditions.

The training process of the TSLN module uses the motion constraint matrix containing all the walking conditions of #001–074, the gallery set uses NM#01–04 of #075–124, and the probe set uses NM#05–06, BG#01–02, and CL#01–02. The number of positive samples was 109 × 110 × 124. To ensure a balanced distribution of positive and negative samples, we comprehensively consider the viewing angles and walking conditions to evenly match the negative samples of the two subjects. Ultimately, we generate approximately 1.5 million negative samples.

4.2. Method Performance Evaluation

We first conduct cross-view and cross-walking condition experiments to evaluate the performance of the new method. The average recognition rate of each viewing angle in the probe set is shown in Table 2. For the three conditions, the average recognition rates at 0° and 180° are the lowest. The reason for this may be that when subjects are facing the camera or facing away from it, the accuracy of 3D pose estimation is relatively low. This situation, in turn, leads to deviations in the 3D coordinates of the joints and further affects the accuracy of the 3D motion constraint matrix. The average recognition rate at 54°–108° is better than that at other viewing angles.

Furthermore, the recognition rates of BG and CL are lower than that of NM: that is, CL < BG < NM. This conclusion is consistent with the results in [6], but the influencing factors of the two conclusions are different. In [6], pedestrians walking with a bag and with a coat affected their gait silhouettes, which further impacted GEI generation. In the new method, walking with a coat affects the accuracy of 3D pose estimation and further impacts the 3D gait constraint matrix generation. Additionally, walking with a bag has less influence on the joint coordinate estimation than walking with a coat. Similar to those for the NM state, the average recognition rates for the viewing angle range of 36º-144° are 49.2% and 36.9%, respectively, which are higher than those obtained at 0° and 180°.

Table 3 lists the recognition rate of each viewing angle under normal walking conditions. The recognition rate of the same angle is higher than that of the cross-view, and the average recognition rate is 94.7%. At 36°–144°, the recognition rate reaches 95.3%. When the viewing angles difference of input pairs differ by approximately 90°, the recognition performance drops sharply. However, it is interesting that the recognition performance instead increases when the viewing angle difference between input pairs is close to 180°. For example, when the angle of the probe set sample is 18° and the angle of the gallery set sample is 126°, the recognition rate is 46.9%. When the angle of the gallery set is 162°, the recognition rate rises to 67.3%. That is, the joint 3D coordinates of the two complementary angles are closer than those of the other angles.

Figure 8 shows the average recognition rate of the three walking states at the probe set view angles of 0°, 36°, 54°, 90°, 126°, and 162°.

4.3. Branch Performance Evaluation

Next, the effectiveness of the 3D gait model and proposed two-way network is verified, and the results are shown in Table 4. The LSTM branch is used alone to distinguish the difference features of a pair of joint motion constraint matrices, and the recognition rate is 45.6%. The CNN branch is used alone to distinguish the similar features of a pair of joint action constraint matrices, and the recognition rate is 40.2%. The average recognition rate of the dual-branch network is 49.9%. These results show that the constructed 3D motion constraint matrix can fully express the spatiotemporal nature of gaits.

4.4. Comparison with State-of-the-Art Methods

Finally, we perform a comparative performance analysis of our approach with respect to other popular state-of-the-art gait recognition techniques, including the appearance-based methods DeepCNNs [5] and the model-based method PoseGait [36]. Table 5 shows the comparison results of the three methods. The proposed method and the LB network of DeepCNNs directly mix two comparative samples to simulate the subtraction operation and learn the difference features between the input pairs to obtain similarity values. DeepCNNs use a CNN-based similarity learning network to learn the difference features of a pair of GEIs, and the average recognition rate is 73.5%. This result is higher than those of our method and PoseGait. As stated in the literature [5], DeepCNNs use the higher-dimensional features of the input GEI, while PoseGait uses 14 human joints as gait features, and our method uses 14 human joints and the center of gravity of the body as the gait features. Another reason for the results above is that the 3D joint coordinates are estimated twice (the two-dimensional joint coordinates are estimated from the video sequence first, and then the three-dimensional coordinates are estimated from the two-dimensional coordinates), not directly from the image sequence (such as for GEIs), which induces more significant errors. Compared with PoseGait, our method uses LSTM and ResNet-50 to learn the spatiotemporal difference features of joint constraints, and the average recognition rate is improved to a certain extent. In addition, the number of training samples is greater than 2 million in our method. The recognition results show that it is feasible to measure the similarities among input pairs separately from the time and space dimensions.

To evaluate the stability of their recognition performance, the variances in the recognition rates of all viewing angles were quantified. The variances of the three walking conditions for the three methods are listed in the second to last row of Table 5. Our variance is the smallest at BG and CL, 29.8 and 40.9, respectively. For the three walking states, the variance of the recognition rate of our method is 38.5, which is lower than the values of the other two methods. The average recognition rate of the proposed method is slightly higher than that of PoseGait. However, our method is much more stable than PoseGait in cross-view recognition. This illustrates that the proposed 3D gait constraint matrix is robust to viewing angle changes.

The sequence-based GaitPart method achieves excellent recognition performance: the recognition rate is 78.7% under CL walking conditions on CASIA-B. Although the recognition rate of our method is not as good as that of GaitPart, in the case of a large viewing angle span, our method using low-dimensional features still achieved good recognition stability (variance statistics: GaitPart – 39.3 and our method – 40.9).

5. Conclusions

This work proposes a two-way similarity learning method based on 3D poses to realize end-to-end gait recognition. First, considering gait invariance in 3D space and the periodic variation in joint motion constraints, we propose a 3D gait motion constraint matrix to express the time sequence constraint changes and spatial constraint changes in the 3D space gait. Then, the LSTM branch is used to learn the temporal constraint difference feature, and the ResNet-50 branch learns the spatial motion difference feature of the input pair. The method was evaluated with respect to the CASIA-B database, and the results show that the recognition rate significantly improved in the case of large changes in the view angle. The proposed 3D gait model has good robustness for view angle and interfering factors such as clothing variation, but the feature dimension is low. Future work will consider combining apparent-based methods to study gait recognition methods based on multimodal gait models and further improve the accuracy of cross-view gait recognition.

Data Availability

The processed data used to support the findings of this study have not been made available because the data also form part of an ongoing study.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by The Key Research and Development Program in Social Development Domain of Shaanxi Province (Grant no. 2022SF-424) and the Natural Science Project of Xi'an University of Architecture and Technology (Grant no. ZR19048).