Aiming at the problem of the absence of detail texture and other high-frequency features in the feature extraction process of the deep network employing the upsampling operation, the accuracy of gesture recognition is seriously affected in complex scenes. This study integrates object detection and gesture recognition into one model and proposes a gesture detection and recognition based on the pyramid frequency feature fusion module and multiscale attention in human-computer interaction. Pyramid fusion module is used to perform efficient feature fusion and is proposed to obtain feature layers with rich details and semantic information, which is helpful to improve the efficiency and accuracy of gesture recognition. In addition, the multiscale attention module is further adopted to adaptively mine important and effective feature information from both temporal and spatial channels and embedded into the detection layer. Finally, our proposed network realizes the enhancement of the effective information and the suppression of the invalid information of the detection layer. Experimental results show that our proposed model makes full use of the high-low frequency feature fusion module without replacing the basic backbone network, which can greatly reduce the computational overhead while improving the detection accuracy.

1. Introduction

Recently, with the rapid development of science and technology, the interaction between humans and machines has appeared in more scenarios [1, 2]. The user’s requirements for the friendliness, usability, and high-efficiency of human-computer interaction methods have been further improved [3]. Traditional mouse and keyboard-based interaction methods have been difficult to meet the needs of usage, prompting many scholars to study intelligent interaction methods, such as face recognition, voice recognition, gesture recognition, and motion recognition [4]. Gesture-based interaction has the advantages of being friendly to people with dysphonia and conforming to human communication habits. It conforms to the development direction of human-computer interaction of “human-centered” and “human-computer harmony,” which has become a major research hotspot in the field of human-computer interaction [4].

Gesture recognition can be adopted to distinguish different types of gestures and then converted into robot control commands, which have great prospect in human-computer interaction [5]. However, the nonrigid characteristic, the complex, and changeable shape of the human-hand, the background change, and the illumination interference during the interaction increase the difficulty of gesture recognition, which makes gesture recognition become a very challenging problem. Research on high-characterization feature extraction and high-discrimination classification methods for gestures will provide inspiration for other recognition problems [6].

Since the computer vision-based gesture recognition method is intuitive, simple, and noncontact, researchers at home and abroad have turned their attention for research [2, 46]. Neverova et al. [7] proposed a recognition method based on active learning and guidance, which uses prior knowledge of skin color to track gestures and uses boost classifiers to detect and recognize gestures. The detection rate, tracking rate, and recognition rate in a dynamic environment are, respectively, 97%, 99%, and 70%. Gleeson et al. [8] first adopted the background subtraction method to obtain the object, used the entropy information to distinguish the hand from the complex image, extracted the contour of the hand with the help of chain code, and used the center of the gravity contour method to improve the recognition effect. Experimental results show that the average recognition accuracy can reach 95% on the six gestures set. Mazhar et al. [2, 9] proposed a gesture recognition method based on hand skeleton, which uses RealSense depth camera to acquire the joint information of the hand skeleton, uses the geometrical shape of the hand to extract valid descriptors, uses the Gaussian mixture model to obtain the Fisher vector to encode each descriptor, and finally uses the support vector machine (SVM) classifier which realizes the gesture classification. Starner et al. [10] regard gesture recognition as an image processing problem and used the hidden Markov model to process gesture features and then realize gesture recognition with an accuracy rate of 94%.

With the rapid growth of information data, there is a huge demand for data intelligent analysis. The artificial neural network has been widely used to meet the information processing needs in big data era. Deep learning is an important branch of artificial neural networks. Since its theory was proposed by Hinton et al. [9, 11], deep learning has developed rapidly to extract deep features, which is used in speech recognition, image classification, and other types of pattern recognition problems and achieved better results. The deep convolutional network is a type of deep learning, which shares weights through convolution kernels and extracts deep features at a small computational cost [12].

Gesture-based human-computer interaction is a friendly and natural way. Gesture detection and recognition is its core technology. In the field of gesture recognition, scientists have made relatively fruitful results and proposed various gesture recognition methods, but there is still a certain gap with actual engineering applications. Neto et al. [13] used the 3D Hopfield neural network to recognize 15 kinds of gestures and obtained good results in the test data, but the effect in actual engineering applications was mediocre. Wong et al. [14] used convolutional networks to automatically extract gestures and upper body features of the human body and achieved better recognition results. However, the existing deep learning methods are also easy interfered by complex backgrounds, and their ability to recognize complex scenes is weak. The basic backbone for gesture recognition tasks is derived from classification networks in deep learning. With the development of deep neural networks, the network structure is getting deeper and deeper. For example, the ResNet [15] and DenseNet [16] have emerged and shown better feature representation capabilities. In order to solve the above problems and improve the accuracy of the deep gesture recognition algorithm, the Ges_Net model replaced the VGG-16 backbone structure of the original SSD algorithm with ResNet-101 and introduced a deconvolution layer to aggregate context information so as to enhance the high-level semantics of shallow features in gesture recognition [17]. As we all know, the parameter efficiency is higher than that of ResNet and DenseNet, so DSOD can be used as the basic backbone network to train the gesture recognizer from scratch. However, the complex network structure means huge network parameters, which will lead to a performance reduction in detection speed. It is worth noting that researchers have found that the deep features in the basic backbone network have more semantic information, while the shallow features have more partial-detail descriptions. In other words, the shallow feature information and the deep feature information are complementary in gesture recognition. For example, the feature pyramid network (FPN) uses horizontal connections and top-down architecture with lateral connections to create feature pyramids and achieve more powerful deep feature representations [18]. In recent years, some improved FPN have combined different feature layers by designing a rainbow connection method that uses pooling and deconvolution operations at the same time. Feature fusion Single Shot Multibox detector generates a large-scale feature by fusing multiple shallow feature layers with different scales to generate a large-scale feature and then constructs a new feature pyramid for detection by downsampling on this large-scale feature layer. Although these deep methods have effectively improved the accuracy of traditional gesture recognition algorithms, their complex feature fusion methods have greatly reduced the detection speed [19]. Therefore, how to use the shallow and deep features to effectively integrate the feature pyramid representation determines the detection and recognition performance.

In this study, in order to improve the accuracy of gesture recognition as much as possible with less computational overhead and avoid the limitations and shortcomings of existing gesture recognition methods, a feature fusion feature fusion module based on the feature pyramid network to perform efficient feature fusion is proposed to obtain feature layers with rich details and semantic information. It is helpful to improve the efficiency and accuracy of gesture recognition. In addition, this study further adopts a multiscale attention module to adaptively mine important and effective feature information from both temporal and spatial channels and embeds it in the detection layer. Finally, our proposed network further realizes the enhancement of the effective information and the suppression of the invalid information of the detection layer. Therefore, this study makes full use of the feature information between the detection layers without replacing the basic backbone network, which can greatly reduce the computational overhead while improving the detection accuracy.

2. Feature Pyramid Network

In the convolutional neural network, the inputted image is composed of high-frequency part and low-frequency part, where the high-frequency part has obvious edge details, and the internal characteristics of the low-frequency image information are complete. The low-frequency information is upsampled layer by layer to obtain the low-frequency content, so as to obtain stable low-frequency feature information [20]. With the continuous iterative training of the deep network, the loss value is reduced through convolution calculation and tends to be stable finally. In the process of backpropagation of the deep network, the missed high-frequency information is constantly compensated due to the extraction of low-frequency information [21]. By separating the high- and low-frequency information of the feature, fully extracting the relevant feature information, and effectively fusing it, the feature expression ability can be enhanced, as shown in Figure 1. As can be seen, the feature pyramid network is divided into a three-level structure. The first level is composed of the primary feature pyramid, and the residual module is used as the primary feature extraction, so the deep features are gradually extracted. In the second level, the network is divided into different frequencies and performs frequency extraction for the low-frequency and the high-frequency feature pyramid. In the third level, adaptive weighted fusion is performed according to the high- and low-frequency information, and the fused feature results are output. Compared with hourglass, the feature pyramid network makes up for the lack of texture information in the feature transfer process. The missing texture information can not only retain the strong geometric representation ability of high-frequency features under the premise of ensuring the calculation efficiency but also integrate effective low-frequency features to further improve the feature information, maintain the stability of feature transfer, and effectively improve the quality of object recognition [22].

There are 2 branches inside the primary feature pyramid module where the residual pyramid structure is adopted and the original scale feature is computed through a shortcut connection. Therefore, multiscale information can be obtained through a series of upsampling. Since the ResNet model is used as the primary feature extraction module, the original features can be retained and the hierarchical information can also keep the input and output dimensions unchanged. The residual pyramid structure can be written as follows:where are the input and output feature tensors, respectively; , , and c represent the spatial dimension and the number of channels; is used as the basic mapping for the fitting of several stacked layers in the primary feature pyramid; represents the feature layer; is the ReLU activation function; and are the weight matrices. The mapped feature and the original feature X are used to perform the superposition and fusion of pixels, which can alleviate the disappearance of gradients and improve the efficiency of feature transfer.

While processing low-frequency information, it is superimposed and fused with the multiscale features returned by the residual connection. While keeping the feature size unchanged, the low-frequency information is continuously extracted and fused, so that the low-frequency information is effectively extracted. Therefore, the low-frequency feature pyramid can be denoted aswhere is the input feature tensor, and is the low-frequency pyramid part, and denote the spatial dimension; represents the number of channels; is used as the basic mapping for the fit of the stacked layers in the low-frequency pyramid; is the feature map generated in the layer; denotes the feature layer; is the nearest neighbor upsampling function; X is the fusion estimate obtained by the low-frequency pyramid fitting matrix; and is the normalized parameter.

Corresponding to the low-frequency pyramid feature, the function of the high-frequency pyramid feature is to enhance the texture information of the image. The lack of high-frequency information will make the feature edge detail unclear, and the object is blurred, and overlaps, the contrast is small, which makes the detection ability of the model reduced.

In order to ensure the complete and accurate acquisition of high-frequency information, it is only necessary to extract high-frequency features that have been extracted from the primary feature pyramid through shortcuts. Using deconvolution operation, high-frequency information is continuously extracted by multiconvolutional layers, so as to obtain more feature information, retain feature detail information to a large extent, make up for missing feature information, and enhance high-frequency information.where is the estimated value of the fitting matrix, is the deconvolution operation, is the normalized parameter, represents the feature layer, is the ReLu activation function, and and are the weight matrices.

In order to fuse different multifocus features effectively, it is necessary to calculate the similarity estimation between the current pixel and all the pixels in the feature map, so as to enhance the spatial global information. The calculation process of high- and low-frequency fusion tensor is as follows:where and denote the high- and low-frequency feature tensors of , respectively, denotes the local region of , and are the fusion coefficients for high-low frequency feature. According to the value of frequency feature energy, the frequency fusion coefficient is determined, namely,where is the scaling factor; , .

Since the feature pyramid network combines the output of the residual network module to make up for the lack of detail information due to the upsampling of the network, it retains the high-frequency image features to a great extent. Through the fusion, it can output more edge detail information, enhance the spatial dimension feature expression ability, and improve the stability of frequency feature transmission. Therefore, in order to improve the accuracy of gesture recognition as much as possible with less computational overhead and avoid the limitations and shortcomings of existing gesture recognition methods, we adopt the feature pyramid network to perform efficient feature fusion, so as to obtain feature layers with rich details and semantic information in gesture recognition.

3. Our Proposed Pyramid Frequency Feature Fusion Module and Multiscale Attention

The basic idea of the gesture recognition network proposed in this study is to improve the feature pyramid network without changing the basic backbone network and use high-low frequency feature fusion module (HLFFF) and multiscale joint attention module (MAM) to improve the insufficient problem that the deep network is not robust enough for gesture recognition in complex background, so as to improve its detection and recognition performance as much as possible. The overall structure of the proposed model is shown in Figure 2. HLFFF is used to fully mine the feature information between the traditional deep feature pyramid layers, so that the fused feature layer can contain rich geometric details and powerful semantic information, while the multiscale joint attention (MJAM) can further distinguish the importance of different areas and feature channels for the feature layer and emphasize the different importance by giving different attention weights, so as to quickly select more important information from them, and then use the most effective information to guide the optimization of the model and improve the detection and recognition accuracy.

3.1. High-Low Frequency Feature Fusion Module

The network structure of high-low frequency feature fusion module (HLFFF) is shown in Figure 3. First, the module uses pyramid structure to fuse the deep and shallow features in the detection layer of the traditional feature pyramid network to obtain the shallow features with rich details and semantic information. Second, in order to obtain the deep feature which contains rich feature information, the module also designs a high-low frequency feature fusion structure, where the shallow feature layer is downsampled and fused with the original deep feature layer by different resolutions. The size of receptive field can roughly express the degree of using context information in high-frequency feature, while the shallow feature layer will lose the context information of the deep feature layer due to simple downsampling. Therefore, as shown in Figure 3, this study designs a hierarchical global prior structure-based pyramid pooling module (PPM) to fuse the shallow feature layer. The shallow feature layer can contain rich feature information after the pyramid pooling module and downsampling [23, 24]. Pyramid pooling module uses information with different scales to fully mine the prior knowledge of global context and construct a global prior representation. This structure can effectively reduce the loss of context information between different resolution regions and is an effective global context prior model.

As shown in Figure 2, high-resolution () and low-resolution () are the two feature layers that need to be fused. High-resolution pooling (P-1) and low-resolution pooling (P-2) are the output feature layers after the high-low frequency feature fusion module. First, the feature pyramid network is used to fuse and to get the P-1 layer. Before feature fusion, S2 needs to be upsampled to get up to make its scale consistent with . Meanwhile, and are fused to get the P-2 layer. Different from the previous step in the feature pyramid network, is selected to generate S1-down feature by downsampling. Before downsampling, S1 needs to go through pyramid pooling module to convert S1-P. And then S1-down is obtained by downsampling. The size of the S1-down is consistent with that of , and finally, it is fused with to form the P-2 layer. The whole high-low frequency feature fusion module can be described by the following equation:where and represent the upsampling high-resolution and downsampling low-resolution operations, respectively; HLFFF represents the high-low resolution feature fusion module, and Concat() represents the dot addition operation. In the pyramid pooling module, the initial feature layer first generates three different size feature maps through pooling operation, directly samples the low-resolution feature map by bilinear interpolation, and obtains the feature map with the same size as the original feature map. Finally, the final global pyramid feature layer s is obtained by cumulative average for the three different levels of features. Feature fusion includes three basic operations: downsampling, upsampling, and feature concatenation. The upsampling and downsampling are used to regulate the scale of each layer, while the feature concatenation is realized by cumulative average. In addition, a 3 × 3 convolution layer is needed to eliminate the aliasing effect caused by upsampling. In order to connect the two feature layers together, we need to use the 1 × 1 convolution layer to unify the number of channels. In addition, we also choose max-pooling and bilinear interpolation for downsampling and upsampling, respectively, to avoid high complexity.

3.2. Multiscale Attention Module

The network structure and flow-chart of the multiscale attention module (MAM) are shown in Figure 4. Multiscale attention module extracts the global dependence between input sequence and output sequence through a nonlinear transformation, which consists of multiscale dot-product attention (MDA) and squeeze and excitation block. Multiscale dot-product attention is an attention mechanism that uses normalized point multiplication to calculate spatial distribution similarity, while squeeze and excitation block is a kind of network structure which can accurately model the correlation between each channel in the convolution feature layer through adaptive learning so as to select useful feature channels from global information, suppress useless feature channels, and make the extracted feature information more directional. The multiscale attention module in our proposed model combines the advantages of multiscale pyramid fusion and visual attention mechanism, which can simultaneously mine the correlation information in the input sequences from both spatial and channel perspectives, and then extract the useful feature information for the recognition image.

It is assumed that is the input feature of the multiscale attention module, and are the scales of the input feature layer, is the number of feature channels, and is denoted as the cth channel of the input feature layer. Squeeze and excitation block includes two mapping functions: squeeze and excitation. Therefore, we can use to stand for squeeze mapping, which can encode the whole spatial feature on all the channels into a global feature. In practice, it is often implemented by global average pooling. First, the input feature is mapped into the global feature space by squeeze mapping to get the cth global feature.

After obtaining the global feature, squeeze and excitation block can adaptively learn the nonlinear relationship between channels by using excitation mapping. The whole process can be expressed by the following equation:where and are the weight matrices and defined in equation (1). In order to reduce the complexity of the model and improve the generalization ability, we use the bottleneck structure including two full connection layers (FC). The first FC layer is the dimension reduction layer, the parameter is , and is the ReLu activation function. The last FC layer is used to restore the original dimension, whose parameter is . The sigmoid activation function is used to get the weight coefficient. Finally, the weight vector s of each channel is multiplied by the original input feature for feature channel to get the output feature space , where represents the cth convolution kernel of the output feature space, and represents the cth convolution kernel of the output feature space. Thus, we can get .

Since the feature space is transformed into three different spaces as , , and , we can get the attention score matrix , which is obtained by the multiplication of and . Finally, the multiscale attention module will be modeled from two different angles of feature channel and space, respectively, to get the key feature information and and add them to get the output of the whole multiscale attention module. The output feature space of the proposed model links the long-term dependency of all positions in the input feature and then mines the global context information of the feature map. It highlights the relevant parts of the feature map and uses the refined information to guide the detection and recognition task in human-computer interaction. The output feature space shows the interdependence between the channels through adaptive learning, strengthens the useful feature information on the important channel, and weakens the useless feature information on the nonimportant channel, which is very suitable for gesture recognition in complex environment.

3.3. Loss Function

The main reason why the recognition accuracy of the two-step detection algorithm is higher than that of the one-step detection algorithm is that the proportion of positive and negative samples in the image dataset is seriously unbalanced during the training process of the one-step detection algorithm, and a large number of simple negative samples dominate the network optimization process, which leads to the lower recall rate. In order to solve this problem, some literatures propose focal loss to reduce the proportion of simple negative samples in the loss function [25]. The specific training process is as follows:

In the training process, the total objective loss function is the weighted sum of the loss used for classification and the loss used for regression. Therefore, total loss function equation is written as follows:where is the number of positive samples of anchor box; indicates the probability that the ith anchor box matches the real box with category ; otherwise, ; is the predicted value of category confidence; is the position prediction information of positive samples; is the position parameter of the true bounding-box; and is the weight coefficient. Therefore, this study mainly focuses on the application of the deep convolutional neural network theory in gesture recognition so as to achieve fast, accurate, and robust gesture detection and the recognition effect and to improve the accuracy of human-computer interaction.

4. Experimental Results and Analysis

4.1. Dataset

Since gesture recognition is a supervised classification task, for training data, it is necessary to provide gesture images and their corresponding tags. In order to prove the effectiveness of our proposed method in human-computer interaction, this study verifies it on four benchmark gesture recognition datasets. The overview and characteristic of the three gesture recognition datasets is given in Table 1, and the specific introduction of each dataset is given in this section.

The ChaLearn organization was established to promote machine learning research and to attract university teams and researchers by organizing various challenges [26]. In some of these years, they organized a series of challenges related to human behavior recognition. ChaLearn 2013 was collected by Kinect. The participating teams were asked to use the data to construct a multimodal classifier. Each team was free to use tools and resources to achieve the results. 9 teams with the best scores were selected and announced the results. The frame rate of the video is 20 fps, and the resolution is 640 × 480 pixels. The database is divided into three parts: training set, verification set, and test set, which contain 6850, 3454, and 3579 samples, respectively. Each sample contains the following four types of data: RGB, depth, mask, and skeleton.

EgoGesture is a multimodal large-scale public dynamic gesture recognition dataset, which contains 83 gestures [27]. This dataset provides the test-bed not only for gesture classification in segmented data but also for gesture detection in continuous data. In addition, we also collected a large number of character images as self-built dataset.

4.2. Experiment Configuration and Parameter Setting
4.2.1. Network Structure

Consistent with the traditional deep gesture recognition algorithm, the proposed algorithm selects the VGG-16 network pretrained on the ImageNet dataset as the basic backbone network to extract the detection layer, and the last two fully connected layers Fc6 and Fc7 are, respectively, converted into convolutional layers Conv_6 and Conv_7. As shown in Figure 2, feature pyramid Conv4_3, Conv7, Conv8_2, Conv9_2, Conv10_2, and Conv11_2 is, respectively, denoted as S1, S2, S3, S4, S5, and S6. Since the structure of feature fusion does not perform well in high-level semantic feature fusion, our improved model chooses the first four layers (S1, S2, S3, and S4) to use the high-low frequency feature fusion module (HLFFF) for feature fusion according to the actual situation, where S1 and S2 are a pair of feature layers, F1 and F2 are generated through HLFFF, S3 and S4 are another pair of feature layers, and F3 and F4 are generated through HLFFF, while S5 and S6 remain unchanged and are denoted to F5 and F6. Then, use multiscale joint attention (MAM) to generate P1, P2, P3, P4, P5, and P6 for F1, F2, F3, F4, F5, and F6 and form the detection layer of the our proposed deep gesture recognition. The detection layer of the proposed algorithm is consistent with the traditional deep learning algorithm in terms of scale and number of feature channels.

4.2.2. Training Details

The scale of the input image of the improved algorithm is divided into 300 × 300 and 512 × 512. The batch gradient descent algorithm (BGD) is used for optimization. The batch size is set to 32, and the maximum number of iterations is 120 k. The warmup strategy is used to set the learning rate, which helps to slow down the over fitting phenomenon of mini batch in the initial stage and keep the distribution stable and the deep stability of the model. The learning rate of the first 1 k iteration is 10−4. From 1 k to 80 k times, the learning rate is increased to 10−3, then from 80 k to 100 k times, the learning rate is reduced to 10−4, and finally from 100 k to 120 k times, the learning rate is reduced to 10−5. In order to ensure the fairness of the experiment, our proposed model first trained and evaluated the traditional comparison algorithms and then performed the corresponding simulation experiment on the proposed algorithm.

4.2.3. Experimental Environment

All experiments in our study are performed on the same hardware, namely, Nvidia GTX 1080 Ti GPU and Intel i7-7800 × 3.50 GHz CPU. The software environment is Python 3.5 and TensorFlow 1.4 as deep learning framework under the Ubuntu 16.04 system and CUDA 8.0.

4.3. Evaluation Metrics

It is well known that recognition accuracy, F1, and recall are commonly used as criteria to evaluate model performance in image classification tasks [28]. F1 score is an index used to measure the accuracy of the classification model in statistics, which is used to measure the accuracy of unbalanced data. It takes into account the accuracy and recall of the classification model. F1 score can be regarded as a weighted average of model accuracy rate and recall rate. Its maximum value is 1 and its minimum value is 0. However, F1 is not suitable for multicategory tasks [29]. Therefore, we use macroF1 and microF1 scores to describe precision and recall. MacroF1 score is calculated by averaging the F1 of all categories and by averaging the precision and recall of all instances. MacroF1 score is defined as follows:where , and are the number of the gestures classified incorrectly into a class , the number of the gestures belonging to classified incorrectly into other classes, classified correctly into other classes excluding , and classified correctly into . Similarly, the definition of MacroF1’s equation is similar to it.

4.4. Ablation Analysis

As given in Table 2, in order to verify and evaluate the effectiveness of the gesture recognition algorithm, high-low frequency feature fusion module (HLFFF) and multiscale attention module (MAM) are proposed in this study, where this study performed a series of ablation analysis in the ChaLearn 2013 dataset. In order to ensure the fairness of the experiment, this study first trained and evaluated the traditional deep gesture algorithm (TDGR). The results show that for a 300 × 300 input image, the proposed algorithm has a mAP of 79.1%, which is 1.6% higher than the traditional TDGR. The increase of the parameter amount decreased slightly, reaching 39.6. For a 512 × 512 input, the proposed algorithm has a mAP of 81.0%, which is 1.2% higher than the traditional TDGR, and the parameter amount has increased by 24.57 M, so the FPS has also dropped slightly to 20.8. The results of further ablation experiments showed that the mAP obtained by using HLFFF alone to improve the TDGR algorithm was 78.8% and the FPS was 47.1, the parameter amount increased by 14.04 M, and the mAP obtained by using MAM alone to improve the SSD algorithm was 78.6% and the FPS was 48.1, the corresponding increased parameter amount is 10.53 M, and the use of the SEB network to improve the gesture algorithm alone has mAPs of 78.2 and 78.4, respectively, FPS is 53.1 and 51.5, respectively, and the amount of parameters has increased by 6.24 M and 4.29 M, respectively. It can be found that the high-low frequency feature fusion module (HLFFF) and multiscale attention module (MAM) proposed in this study are very effective, and MAM is obviously more effective than SEB because it combines the advantages of the two, which can be obtained from mining key information at the same time in both directions of space and passage. In terms of parameter quantity, HLFFF is more complicated than MAM, and its increased parameter quantity is relatively more. Since MAM module combines SEB, its increased parameter quantity is the sum of the two increased parameter quantities. On the whole, all the parameters added by the proposed algorithm are derived from HLFFF and MAM. Compared with the use of complex basic backbone networks and complex feature fusion methods, the TDGR algorithm in this study has increased a small amount of computational overhead, while achieving greater performance improvement is very effective.

4.5. Qualitative and Quantitative Comparison

Recently, there have been many gesture recognition algorithms for natural images, but these algorithms are based on the detected hands for gesture recognition. The purpose of our proposed model is to integrate hand detection and gesture recognition to improve the precision of gesture recognition and reduce the influence of background interference. Since the detection algorithm based on deep learning has achieved great results in the field of natural images. In order to further verify and evaluate the gesture recognition algorithm in this experiment, it is compared with some other advanced gesture recognition algorithms, such as GestNet [16], GMRCNN [17], Gst_SSD [18], Dense_Ges [19], and FMD + DTW [20]. All selected comparison algorithms are analyzed using the source code given by the author. Although these deep models are not aimed at gesture recognition, these models do not distinguish object categories. As long as sufficient training data are given, the corresponding detection results can be obtained. All comparison models are experimented with the same training set and test set so as to facilitate fair and fair comparative analysis of all comparison algorithms. The quantitative comparison results for recognition accuracy are given in Table 3.

Gst_SSD is based on the multiscale convolutional detection SSD algorithm and introduces the feature fusion idea of different convolution layers. After the dilated convolution downsampling operation and the deconvolution upsampling operation, the fusion of the shallow visual convolutional layer and the deep semantic convolutional layer are fused in the network structure, so as to replace the original convolutional layer for gesture recognition, which can improve the recognition accuracy. However, the Gst_SSD model cannot recognize the small-scale gestures. GMRCNN uses the gesture matrix as a parameter to input into the neural network for deep learning pattern recognition, uses the fully connected network structure and parameter selection to prevent overfitting, and then combines fingertip detection and elimination algorithms to improve recognition accuracy. It is worth mentioning that the accuracy of the GMRCNN algorithm is completely dependent on the gesture matrix. GestNet uses the PNN neural network for classification and combines the transfer learning method to transfer the deep learning model to the constructed model. The verification experiment conducted on the public dataset Keck Gesture shows that the deep network has certain advantages for gesture recognition, but it still has shortcomings and is not suitable for engineering applications. Dense_Ges obtains the gesture anchor of the training set through K-means dimensional clustering [30] and is responsible for detecting gestures of different scales; finally, it uses transfer learning and fine-tuning methods to train the gesture recognition model. The average recognition rate of gestures with large interference in complex backgrounds is only 49.2%. FMD + DTW is a method that combines DTW and deep learning for gesture recognition. By analyzing the extracted skeleton information, real-time coarse-grained recognition of continuous gestures is realized, and then, ROI is extracted through joint point calculation, and the feature extraction is performed on ROI information using a 7-layer CNN network, and fine-grained recognition is obtained.

5. Conclusion

In order to improve the accuracy of gesture recognition as much as possible with less computational overhead and avoid the limitations and shortcomings of existing gesture recognition methods, a multiscale feature fusion module is based on the feature pyramid network to perform efficient feature fusion so as to obtain feature layers with rich details and semantic information. It is helpful to improve the efficiency and accuracy of gesture recognition. In addition, this study further adopts a multiscale attention module to adaptively mine important and effective feature information from both temporal and spatial channels and embeds it in the detection layer. Finally, our proposed network further realizes the enhancement of the effective information and the suppression of the invalid information of the detection layer. Experimental results show that our proposed model makes full use of the high-low frequency feature fusion module without replacing the basic backbone network, which can greatly reduce the computational overhead while improving the detection accuracy.

Data Availability

The labeled datasets used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The author declares that there are no conflicts of interest.


This work was supported by Jiangxi University of Technology.