Abstract

There are some problems in the current human motion target gesture recognition algorithms, such as classification accuracy, overlap ratio, low recognition accuracy and recall, and long recognition time. A gesture recognition algorithm of human motion based on deep neural network was proposed. First, Kinect interface equipment was used to collect the coordinate information of human skeleton joints, extract the characteristics of motion gesture nodes, and construct the whole structure of key node network by using deep neural network. Second, the local recognition region was introduced to generate high-dimensional feature map, and the sampling kernel function was defined. The minimum space-time domain of node structure map was located by sampling in the space-time domain. Finally, the deep neural network classifier was constructed to integrate and classify the human motion target gesture data features to realize the recognition of human motion target. The results show that the proposed algorithm has high classification accuracy and overlap ratio of human motion target gesture, the recognition accuracy is as high as 93%, the recall rate is as high as 88%, and the recognition time is 17.8 s, which can effectively improve the human motion target attitude recognition effect.

1. Introduction

Humans’ perception of information from the outside world is mainly obtained by vision. With the advancement of science and technology, the application of computer technology and related equipment to perceive and understand human behavior and actions has gradually formed, and a wealth of image and video information has been produced [1]. Therefore, how to combine algorithm and logical operations to make computers have human visual functions and perform analysis is a research hot-spot in computer simulation and application technology [2]. Among them, the gesture recognition of human motion targets is an important research direction of Computer Vision. Human motion gesture recognition technology is a technology that analyzes the relevant information of human motion behavior and judges the state of human motion behavior. It can provide user exercise status information, so it is widely used in sports health, user social behavior analysis, indoor positioning, and other fields.

Deep neural network is a fully connected neuron structure with multiple hidden layer structures. As a representative technology of deep learning, deep neural network has achieved great research results in visual research fields such as human behavior recognition. In literature [3], a human body gesture recognition algorithm based on CNN is proposed. By constructing a convolutional neural network model, a total of 11 layers of the model are set up, and five human gestures are collected in the human gesture data set, and the human gesture is operated by convolution and pooling. The fully connected layer of convolutional neural network is used to classify and process the human gesture data set, and the data set is trained and recognized to realize the human gesture recognition. The algorithm’s human gesture extraction feature recognition efficiency is high, but the algorithm’s extraction feature recognition recall rate is low. In literature [4], a method for detecting the attitude of astronauts in a space capsule in a weightless environment based on a fast open attitude model is proposed. By constructing a fast open attitude model and using a lightweight deep neural network, the attitude features of the astronauts in a weightless environment are extracted. To ensure the accuracy of model recognition, three small convolution kernels are used to build a fast open attitude device; through the parameter sharing of the convolution process, the branch structure is changed; the residual network is used to suppress the hidden danger of gradient disappearance; and the astronauts work attitude detection is realized. The detection efficiency of the astronaut’s operation gesture of this method is high, but there is a problem of low attitude detection accuracy. In literature [5], a deep neural network based on contextual long- and short-term memory architecture is proposed, which uses content and metadata to detect robot context features. It extracts from user metadata and uses it as an auxiliary input to process the tweet text in the contextual long- and short-term memory network, but the feature extraction effect of this method is poor. In literature [6], a new method for training deep neural networks to synthesize dynamic motion primitives is proposed. It can use a new loss function to measure the physical distance between motion trajectories, rather than between parameters that have no physical meaning. We evaluate the proposed method and show that the method's loss function minimization can get better results than using more traditional loss functions, but the algorithm recognition time is longer.

In response to the above problems, this paper proposes a human motion target gesture recognition algorithm based on deep neural network. The idea is as follows:(1)According to the static gesture of the human, the distance between the key nodes is calculated. The Kinect interface equipment is used to collect the coordinate information of the human bone joints, calculate the difference in the feature value of the human motion gesture, extract the node characteristics of the motion gesture, and use the deep neural network to build the overall structure of the key node network and reduce the node position.(2)The local recognition region is introduced to generate high-dimensional feature map, and the sampling kernel function is defined to determine the neighborhood of the central pixel.(3)The depth neural network classifier is constructed to obtain the weighted value of the depth neural network classifier, calibrate the gesture features of the human motion target, fuse and classify the gesture data characteristics of the human motion target, obtain the result of the gesture recognition of the human motion target, and realize the gesture recognition of the human motion target.

At present, a large number of scholars in academia at home have carried out extensive research on the gesture recognition of human motion targets and have achieved certain research results. In literature [7], a human action recognition framework with invariable depth and perspective is proposed, which encapsulated the motion content of the action as an RGB dynamic image, which was generated by an approximate rank pool. And the fine-tuned receiving model is used for processing, long short-term memory (LSTM) and bidirectional long short-term memory (Bi-LSTM) learning model sequence is used to learn the long-term view invariant shape dynamics of the action, and the view invariant features of the key deep human gesture frame based on the structural similarity index matrix are generated. The algorithm has a short recognition time, but the algorithm is affected by the complex changes in the position of key nodes, resulting in lower accuracy of human action recognition. Literature [8] proposed to learn human gesture model from synthetic data for robust RGB-D motion recognition. By analyzing the human gesture in a large amount of human motion targets human skeleton data, 3D human body assembly of different body shapes is synthesized and each gesture with 180 camera viewpoints is rendered. At the same time, the clothing texture, background, and lighting are randomly changed, and the generative confrontation network is used to calculate and minimize the gap between the synthesis and the real image distribution. The learning CNN model is used to transfer the shared human gesture. The CNN model invariant feature extractor is constructed. Pyramid models time changes and uses linear support vector machines to achieve classification. This algorithm has better performance in RGB and RGB-D action recognition, but there is a problem of longer recognition time. In literature [9], a new ellipse distribution coding method is proposed to understand the behavior of the human gesture under infrared imaging. First, the elliptical Gaussian coordinate coding is used to calculate the relationship between adjacent joint points, and then the prediction between the infrared image and the real image is measured. In the end, the infrared human gesture image recognition is completed, but the algorithm takes a long time to recognize.

There are also many studies in China. In literature [10], a human gesture recognition method is proposed based on a small number of key sequence frames. By preselecting the original motion sequence, using the motion trajectory to obtain the extreme value method, the primary key frame sequence was constructed, and the frame reduction algorithm was used to obtain the final key frame sequence. According to different human gestures, a hidden Markov model is constructed, the Baum–Welch algorithm is used to obtain the trained model, and the forward algorithm is used to recognize the human gesture. The algorithm’s human gesture recognition accuracy is relatively high, but the algorithm has a large amount of calculation and has the problem of long recognition time. In literature [11], a multiperson gesture detection algorithm optimization based on reinforcement learning is proposed and the SSD algorithm is used to construct a target detector, obtain the initial bounding box of the human body, and set it as an agent. Reinforcement learning is used to combine Markov decision process and Q network to build a target fine model to train the agent and iteratively adjust its nine actions and four directions, and the stacked hourglass algorithm is used to build a gesture detector to detect the gesture of the adjusted bounding box. The human body detection accuracy of this algorithm is high, but the recall rate of human gesture detection is low. In literature [12], a human skeleton behavior recognition method is proposed based on temporal and spatial weighted gesture motion characteristics. The bilinear classifier is used to iterate to calculate and obtain the action weights of the joint points and static gesture categories and determine the joint points and gestures. Dynamic time warping and Fourier time pyramid algorithm are used to construct a long time sequence model of human skeleton behavior, and support vector machines are used to classify human skeleton behaviors to realize human skeleton behavior recognition. The algorithm has a good recognition effect, but the recognition time of the human skeleton behavior of the algorithm is longer.

For this reason, this paper proposes a human motion target gesture recognition algorithm based on deep neural network and uses MSCOCO data set and MPII data set as data sources to test the proposed algorithm, which verifies the superiority of the method proposed in this paper.

3. Algorithm for Target Human Motion Gesture Recognition Based on Deep Neural Network

3.1. Extract Motion Gesture Node Features

According to the static gesture of the human, the distance between the key nodes is calculated to extract the gesture characteristics of the human motion target. From a physiological point of view, there are a total of 20 human bone joints, which are used as the key nodes of the human movement target gesture [13]. By observing the movement of the human, the correlation of the bones and joints is obtained, and the key nodes of different movement gestures are selected in a targeted manner. For different human bodies, the size of the bones cannot be exactly the same, so according to the distance information of the key nodes, the movement gesture of a specific individual is represented. For different sports, the movement distance of the head, arms, and legs is not fixed. According to different gestures, the distance of joint points changes with the movement [14, 15]. In order to eliminate the influence of the change of the joint point spacing on the human gesture, the joint point spacing is set as a fixed distance value, and this distance value represents the static gesture characteristics of the human. First, the features of the static gesture of the human are extracted. The human skeleton in the static state is selected, the human head joint as the reference node is used, the distance from other joints to the head joint is calculated, and this distance is used as the element of the feature matrix [16, 17]. The static gesture feature can also characterize the motion gesture at the same time. Based on the static gesture feature extraction, the motion gesture feature is further extracted. The motion gesture objects are human bones at different moments, and frames of pictures are extracted equidistantly for each motion gesture, and the displacements of each joint point of the and frames of each motion gesture are calculated.where refer to the position coordinate of a node at the image of frame and are the position coordinate of the same node at the image of frame . Images of frame are selected, and nodes in each frame are extracted, and displacement at each node is obtained, which corresponds to coordinates. The characteristic matrix characterizes this movement gesture by coordinate distance .The calculation formula is as follows:where is the distance of the node ’s frame and frame . The aforementioned feature matrix is the feature vector of the motion gesture. The key information required for feature extraction is the coordinate position information of the key nodes of the human skeleton. Considering that the coordinate information needs to be relatively stable and have high accuracy, this study uses the Kinect interface device to collect the coordinate information of the human bone joints. According to the above process, the difference of the feature values of different human motion gestures is calculated, and the extraction of the motion gesture features is completed.

3.2. Construction of the Node Structure Diagram

Since the structure of the key nodes of the human gesture is a graph structure, it needs to be processed effectively. A deep neural network is used to construct a graph of the movement gesture node structure, and the key nodes are learned to realize regional positioning. The purpose of constructing the overall structure of the network is to learn the position of the node graph in the corresponding input image, and each position is divided into different regions to achieve the purpose of reducing the position of the node. The narrowed positioning range is a regional component, and a certain key point corresponds to a region category so that a key trajectory is established, expressed as a multilayer bone sequence, and each node corresponds to a specific multilayer bone sequence [18, 19]. According to the characteristics of the network, hierarchical representation and localized distinction are carried out, and the purpose of sequence classification is realized through the above conversion. Each graph of the original data has a corresponding node graph. In the key point coordinate data, the data are a series of frames, and each frame has node joint coordinates. The two frames of node vectors are constructed into a spatial structure graph [20, 21]. The abovementioned spatial structure can be used as the input of the image in the neural network, and the adjacent grid becomes the specific area of the image pixel, and the high-latitude feature map is obtained after convolution processing. Classification processing on the feature map is performed to obtain the corresponding coordinate position [22]. The coordinates of key nodes undergo affine transformation and can be rotated within a certain range of angles to construct gestures of different spatial structures. It is shown in Figure 1.

After a single frame of human gesture has been rotated through key nodes , the multiframe state and are formed. The combination becomes the graph data, which are used as the input of the deep neural network. A key node matrix based on the graph data is established, which is a collection of all nodes in the gesture structure graph. The formula is defined as follows:where is the graph structure gesture matrix at frame and of nodes; is the position function, and the set is and ; and is the node coordinate confidence, and the set is . The confidence is used to judge whether the key node exists. Through the above conversion process, the overall structure of the key node network is constructed to provide a graph structure basis for subsequent network training.

3.3. Locate the Local Recognition Area of the Node Structure

In the gesture structure, the position changes of the key nodes of the human are more complicated. Compared with the number of pictures, the feature map constructed by the key nodes can express limited information. To accurately recognize the overall motion target gesture, a large number of training coordinate positions and classification label values need to be calculated, which not only increases the difficulty of recognition but also increases the amount of calculation [23]. Therefore, in this research, the overall positioning of key points is transformed into the positioning of the local recognition area to improve the computational efficiency of the gesture recognition algorithm. The target detection task mainly finds the target from an image, unifies the detection steps into the deep network, first inputs the original scale feature map, and then propagates forward to the shared convolutional layer to generate a higher-dimensional feature map. Classification and position regression are performed on multiple feature scales at the same time, each pixel of the feature map is taken as the center, and a default box is generated and mapped from the feature map to the original image location according to the center point coordinates. The sampling kernel function is determined, and the neighborhood of the center pixel is determined [24, 25]. In the single-frame image structure, the center position to the surrounding forms a grid so that the neighborhood pixels form a fixed spatial order, and the retrieval is performed according to the dimension. Because the spatial structure graph is irregular, it is necessary to set a fixed node as the root node to mark the neighborhood set to realize the weight function. The weight function is iterative updated by the network’s preset hyperparameters. Sampling function and weight function are converted into network form. The formula is defined as follows:where is the network form of node and belongs to the neighboring domain . is divided into subsets, and each subset corresponds to a weight value; is a node of neighboring domain, and refers to the sampling function, refers to the weight function, and is a regular term and node neighborhood subset cardinality. The contribution of different subsets to the neural network is analyzed The formula is defined as follows:where is the subset to label specific points. Formula (5) is similar to the standard network. It is used for a single frame of image and changes the pixel scale. Each subset corresponds to a single node. After the neural network is sampled, the state of the single node is learned to obtain the graph structure characteristics of the key nodes of the human. After classification, the areas where different key nodes are located are distinguished. Further, the minimum spatio-temporal neighborhood is divided into the multisequence space. The minimum time domain represents the minimum set of nodes in different frames, and the minimum space domain represents the minimum set of nodes in a single frame. Since the data structure is interconnected with the same number, it cannot be used to describe the same neighborhood, so the bone sequence is extended to the space-time domain. The formula is defined as follows:where is the minimum time and space domain of node ; is the sampling distance from node to ; is the number of sampling frameworks, and is the time domain length. Through time-space sampling, the smallest time-space feature is detected so as to complete the location of the local recognition area of the feature map.

4. Establish the Algorithm for the Recognition of Target Motion Gestures of Humans

The gesture recognition algorithm is based on a large amount of data and establishes the relationship between the target output and the actual output by constructing an activation function. By continuously adjusting the weights and variation parameters, the optimal solution is obtained to realize the recognition of the human motion target gesture, and the algorithm of the human motion target gesture recognition is described as follows.

(i)Input: the original data of human motion, as well as the positioning results in the minimum space-time domain of the motion gesture node structure diagram.
(ii)Output: recognition results of human motion target gesture of .
(iii)According to the test samples in the gesture database of the human motion target to be recognized and the sample training set , obtain the human gesture feature distribution set as .
(iv)Where is the number of human motion target gestures in the training set.
(v)Construct a deep neural network classifier and obtain the weighted value of the deep neural network classifier as .
(vi)where is the initialized eigenvalues and is binarized fitting results. Through the feature extraction results of motion gesture nodes, the deep neural network is introduced to obtain the input and output iterative equations of the deep neural network classifier: .
(vii)where is the learning pace length and is the maximum iteration times of the training. Using the structural similarity algorithm, the weighted coefficient of the human motion target gesture classifier is obtained and expressed as .
(viii)Through the deep neural network classifier, the gesture characteristics of the human motion target are calibrated, and the recognition statistics are obtained as .
(ix)According to the classification method of the video image, the data are fused and classified and recognized. The image pixels after feature extraction are traversed through the window sliding to traverse the entire image, and the calculation process can be expressed as .
(x)where is the traverse result, is the position of the output of the feature map of the previous layer of the node; is the value of the feature diagram at line and column , is the value at line and column ; and is the derivative error.
(xi)As the number of traversal results deepens, a connection and sharing relationship is formed, and the activation function is used to transform the linear transformation into a nonlinear transformation [19]. After introducing the nonlinear activation function, the deep network can simulate any function. The PReLU activation function was selected for this study .
(xii)where is the input node. This function is a piecewise function. When , the gradient is not 0, which solves the dead zone problem of the disappearance of the gradient. The sliding window is used with the same size and step size to calculate the sliding matrix and feature map. From the perspective of the amount of data and the number of parameters, the amount of calculation is reduced. It can reduce dimension and abstract results at the same time and improve the fault tolerance of the algorithm [20].
(xiii)The fully connected method is used to connect the network nodes, and the output formula of each neuron is .
(xiv)where is the input value of the node; is the activated function; is the weight vector, and is the deviation. is the transpose symbol. Through the abovementioned full connection method, the output information characteristics can be gathered.
(xv)The aggregated human motion target gesture information features are extracted, and the gesture recognition result of the human motion target based on the difference of biological characteristics is obtained: .
(xvi)End

To sum up, the specific process of human motion target gesture recognition is shown in Figure 2.

5. Experimental Analysis and Results

In order to verify the effectiveness of the algorithm for the recognition of human motion target gestures based on deep neural networks, simulation experiments are carried out. The experiment uses MATLAB simulation software, combined with the Libsvm simulation toolbox, applies the human motion target gesture recognition algorithm in the actual operation simulation, and uses deep neural network technology to recognize the human motion target gesture.

5.1. Experimental Environment and Data Set

The experiment uses MSCOCO data set and MPII data set and conducts training and testing on this data set. The MSCOCO data set is a human gesture estimation data set, which contains about 30,000 sample images of human images and camera-collected images, and the number of joint points is 18. The MPII data set is a state-of-the-art articulated human gesture estimation benchmark. It contains about 25,000 sample images, including more than 40,000 people with annotated body joints, and the number of joint points is 16.

Sixteen human joint points in MSCOCO data set and MPII data set were selected and numbered, and 10,000 sample images data were selected from each of the two data sets, with a total of 20,000 sample images for experimental analysis. Randomly 10,000 sample images are selected as the training data, and the rest 10,000 sample images as the test data. In the MATLAB simulation software for 100 groups of training, algorithm training is used to test the human skeleton characteristics. The training parameters of the gesture recognition algorithm are shown in Table 1.

5.2. Evaluation Criteria

(i)Gesture Classification Accuracy. This refers to the correctness of the classification of the human motion target’s gesture, which is used to reflect the accuracy of the human motion target’s gesture classification. The calculation formula for the accuracy of gesture classification is as follows:where is the number of correctly classified human motion target gestures and is the total number of human motion target gestures to be classified.(ii)Gesture Recognition Time. The gesture recognition time is used as an indicator to compare the proposed algorithm with Literature [7]–Literature [11] algorithm to verify the performance of the proposed algorithm.(iii)Gesture Recognition Recall Rate. It refers to the degree of success in recognizing the gesture of the relevant human motion target in the data set for measuring the human gesture estimation. The calculation formula of the recall rate of gesture recognition is as follows:where refers to the number of related human motion target gestures recognized and is the total number of human motion target gestures to be recognized.(iv)Overlap Ratio. The overlap ratio describes the overlap between the output of the algorithm and the calibration range, and the overlap ratio can be used as a key index to judge the effectiveness of gesture recognition.(v)Gesture Recognition Precision. Taking the gesture recognition precision as the index, the advantages of the proposed algorithm are verified.

5.3. Results and Discussion

According to the examples of human motion gesture on MSCOCO data set and MPII data set, the human motion gesture classification accuracy of the proposed algorithm and the algorithms in literature [7]–literature [11] are compared and analyzed by using different data sets. It is shown in Figure 3.

Based on the example of the depth map of human motion gesture in Figure 3, the accuracy of human motion gesture classification of different algorithms is calculated, and the comparison results are shown in Table 2.

According to Table 2, the proposed algorithm has higher accuracy rates of human motion gesture classification on the MSCOCO data set and MPII data set, respectively, as 0.82 and 0.86. The highest accuracy rate of other literature algorithms does not exceed 0.65, and the highest accuracy rate of literature [7] is 0.63. The highest accuracy rate of literature [8] is 0.65, the highest accuracy rate of literature [9] is 0.62, the highest accuracy rate of literature [10] is 0.58, and the highest accuracy rate of literature [11] is 0.50. The deep neural network used in this paper has strong representation ability and the best classification effect.

In order to comprehensively evaluate the performance of this gesture recognition algorithm, the index of overlap ratio is proposed, and different algorithms are compared and analyzed. The higher the overlap ratio is, the closer the algorithm output value is to the calibration value and the better the algorithm performance is. The comparison results of the overlap ratio of different algorithms are shown in Figure 4.

Literature [7] algorithm has a maximum overlap ratio of about 60% of the human motion target gesture recognition results, literature [8] algorithm has a maximum overlap ratio of approximately 58%, and literature [9] algorithm has a maximum overlap ratio of approximately 50%. In literature [10], the highest overlap ratio of the algorithm is about 58%, the highest overlap ratio of the algorithm in literature [11] is about 56%, and the highest overlap ratio of the proposed algorithm is about 80%. It can be clearly seen that the algorithm recognition results in this paper have a higher overlap rate with the calibration range, and the recognition effect is better.

In order to verify the gesture recognition precision of the human motion target gesture recognition algorithm based on the deep neural network, the Literature [7]–Literature [11] algorithm and the proposed algorithm are used to test the recognition precision of the human motion target gesture recognition algorithm. In this way, the comparison results of the recognition precision of human motion target gestures of different algorithms are obtained. It is shown in Figure 5.

According to Figure 5, when the number of iterations is 500, the average human motion target gesture recognition precision rate of the algorithm in Literature [7] is 80%, and the average human motion target gesture recognition accuracy rate of the algorithm in Literature [8] is 60%, and the average human motion target gesture recognition precision rate of the algorithm in Literature [9] is 70%, the average human motion target gesture recognition precision rate of the algorithm in Literature [10] is 78%, and the average human motion target gesture recognition precision rate of the algorithm in Literature [11] is 60%, and the average human motion target gesture recognition precision of the proposed algorithm is as high as 93%. It can be seen that the proposed algorithm has a high precision in the recognition of human motion target gestures and can effectively improve the precision of human motion target gesture recognition. Because the proposed algorithm calculates the distance between key nodes according to the static gesture of the human, the Kinect interface device is used to collect the coordinate information of the human bone joints, the difference in the feature value of the human motion gesture is calculated, and the node feature of the motion gesture is extracted, thereby reducing the actual feature value. The deviation improves the recognition precision of the human motion target gesture.

In order to verify the gesture recognition time of the proposed algorithm, the Literature [7] algorithm, the Literature [8] algorithm, the Literature [9] algorithm, the Literature [10] algorithm, the Literature [11] algorithm, and the proposed algorithm are used to compare the recognition time of each joint point of the human motion target gesture of different algorithms. In this way, the comparison results of the recognition time of human motion target gestures of different algorithms are obtained. It is shown in Table 3.

According to Table 3, with the increase in the joint points of the human motion target gesture, the recognition time of each joint point of the human motion target gesture of different algorithms increases. When the human motion target gesture has 16 joint points, the recognition time of each joint point of the human motion target gesture of the algorithm in Literature [7] is 21.3 s, and the recognition time of each joint point of the human motion target gesture of the algorithm in Literature [8] is 27.3 s, the recognition time of each joint point of the human motion target gesture of the algorithm in Literature [9] is 24.6 s, the recognition time of each joint point of the human motion target gesture of the algorithm in Literature [10] is 28.4 s, the recognition time of each joint point of the human motion target gesture of the algorithm in Literature [11] is 29.7 s, while the recognition time of each joint point of the human motion target gesture of the proposed algorithm is only 17.8 s. It can be seen that the recognition time of each joint point of the human motion target gesture of the proposed algorithm is shorter, and the human motion gesture can be recognized more quickly. Therefore, the proposed algorithm uses a deep neural network to build the overall structure of the key node network, reduces the position of the node, and converts the overall positioning of the key point into the local recognition area to improve the calculation efficiency of the gesture recognition algorithm, thereby shortening the recognition time of each joint point of the gesture. On this basis, we further verify the recall rate of human motion target gesture recognition of the proposed algorithm and obtain the comparison results of the recall rate of human motion target gesture recognition of different algorithms. It is shown in Figure 6.

According to Figure 6, when the number of input images is 10000, the average recall rate of human motion target gesture recognition in algorithm [7] is 50%, and the average recall rate of human motion target gesture recognition in algorithm [8] is 68%. The average recall rate of human motion target gesture recognition in algorithm [9] is 75%, the average recall rate of human motion target gesture recognition in algorithm [10] is 59%, and the average recall rate of human motion target in algorithm [11] is 79%. The gesture recognition recall rate is 88%, and the average human motion target gesture recognition recall rate of the proposed algorithm is 88%. It can be seen that the proposed algorithm has a higher recall rate of human motion target gesture recognition.

6. Conclusions

In order to improve the accuracy and recall rate of human motion target gesture recognition and shorten the time of human motion target gesture recognition, a deep neural network-based human motion target gesture recognition algorithm is proposed. According to the static gesture of the human, the distance between the key nodes is calculated, and the Kinect interface device is used to collect the coordinate information of the human bone joints, calculate the difference in the feature value of the human motion gesture, and extract the node characteristics of the motion gesture to improve the recognition accuracy of the human motion target gesture. Deep neural network is used to build the overall structure of the key node network, reduce the position of the node, locate the smallest space-time domain of the node structure diagram through time-space sampling, improve algorithm computing efficiency, and shorten recognition time. This algorithm can distinguish different human gestures and has certain validity and feasibility.

However, due to the complexity of the human motion target gesture recognition process, there is still something to be improved in this research. Subsequent research can be conducted from the aspect of gesture similarity to evaluate the difference in similarity between the human gesture and the standard gesture so as to measure individual differences to achieve priority matching.

Data Availability

The data used to support the findings of this study are included within the article. Readers can access the data supporting the conclusions of the study from MSCOCO data set and MPII data set.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by Ministry of Education Science and Technology Development Center Fund under grant no. 2020A050116.