Abstract

Aiming at the problems of low recognition rate and slow recognition speed of traditional body action recognition methods, a human action recognition method based on data deduplication technology is proposed. Firstly, the data redundancy technology and perceptual hashing technology are combined to form an index, and the image is filtered from the structure, color, and texture features of human action image to achieve image redundancy processing. Then, the depth feature of processed image is extracted by depth motion map; finally, feature recognition is carried out by convolution neural network so as to achieve the purpose of human action recognition. The simulation results show that the proposed method can obtain the optimal recognition results and has strong robustness. At the same time, it also fully proves the importance of human motion recognition.

1. Introduction

Human motion recognition is the main research content of artificial intelligence and pattern recognition, which has attracted wide attention of scholars from all walks of life, for a long time, with the continuous progress and development of science and technology [1, 2]. Human motion recognition is faced with huge challenges due to the different influences of chaotic background and perspective changes.

Currently, humans can recognize most movements at a glance, but some remain elusive. Human movements can be set as a series of orderly poses, and there are great differences between different people or the same people when they perform the same actions under different circumstances [3]. As the types of human movements continue to increase, the overlap between each class will also increase, which brings a certain difficulty for human movement recognition.

In the following, a human motion recognition method based on data redundancy technology is proposed by combining with data redundancy technology. The experimental data analysis shows that the proposed method can realize human motion recognition quickly and accurately.

2. Methods

2.1. Image Deredundancy Processing Based on Data Deredundancy Technology

In terms of image redundancy, the wrong matching means that the correct image may be deleted, and at the same time, it will bring a series of economic losses to users [3]. In order to better deredundancy processing of the image, the following research is conducted in combination with data deredundancy technology to effectively overcome the deficiency of the deredundancy method and greatly improve the retrieval accuracy.

Through perceptual hash, images with similar structures in space are mapped to the same bucket, which can not only enhance the retrieval speed but also quickly complete the initial image filtering processing [4, 5]. Then, aiming at the images in the same bucket, the average gray feature of the block is combined with the wavelet transform to complete the image multiple filtering so as to effectively improve the image accuracy. The specific operation steps are shown in Figure 1.

Perceptual hashing is a kind of one-way mapping from multimedia dataset to perceptual digest set; that is, the digital representation of multimedia with the same perceptual content is uniquely mapped to a digital digest and satisfies the perceptual robustness and security. Perceptual hashing provides secure and reliable technical support for information services such as multimedia content recognition, retrieval, and authentication. For images with similar visual perception, the hash values are similar or close to each other with high probability. However, for images with different visual perception, the probability of repeated similar hash value is significantly lower.

In order to realize image deredundancy processing, the following implementation is combined with data deredundancy technology. The specific operational steps are as follows.

2.1.1. Pretreatment

In order to enhance the speed of image preprocessing, it is necessary to process all images uniformly and make them into grayscale thumbnails with specifications of .

2.1.2. Extraction of Image Features

The grayscale thumbnail is divided into image blocks, and the average grayscale value of each image block and the average grayscale of image are calculated, respectively, to form the block average grayscale features of image [6]:

Among them,

2.1.3. Forming Hash Perception

Hash perception [7] is formed through the block average gray level feature , which can be expressed in formulas (1)–(5):

2.1.4. Hash Clustering

Place the same image with the same perception of hashing in the same bucket.

Images with the same color sequence structure are not duplicate images; they will most likely fall into the same bucket. Thus, it can be seen that the hash perception and data redundancy technology can only accelerate the index speed of the image and complete the initial image filtering. The following needs to further filter the image repeatedly; in addition, combined with the above operation steps, it can be seen that the gray value of the average gray feature vector of the repeated image block in the same dimension is approximate, and the difference between them is low.

The following is to calculate the similarity between the two images through the Manhattan distance:

In the above formula, represents the block average gray scale feature vector of image and represents the block average gray scale feature vector of image .

The block average gray feature mainly adopts the intermediate result formed by perceptual hashing filtering, so the overall operation efficiency is very fast, but the details of the point information are not fully understood. The Haar wavelet is set as the third layer of image filtering, in which the wavelet can not only accurately reflect the global information of the image but also reflect the local features of the image in detail. The specific filtering process is as follows:(1)Fast wavelet decomposition of the collected images;(2)Extract the first 60 elements with the largest absolute value from all the elements and adopt the one-dimensional form. For elements with negative absolute value, multiply their one-dimensional subscript by 1;(3)All the one-dimensional elements are sorted, and the sorted vector is the eigenvector;(4)The similarity between different feature vectors is calculated by the Manhattan distance.

To sum up, the image redundancy processing is completed.

2.2. Human Movement Recognition

For images processed by deredundancy, human action recognition is carried out by combining depth information and improved convolutional neural network, which mainly includes operation steps such as feature extraction and feature recognition. The specific operation steps are given as follows.

Depth information usually provides the pixel value of human motion image through the corresponding reference system, which is affected by different factors. In order to effectively filter out the external influence on the identification results, firstly, the depth information in the image should be preprocessed and the pixel value of the collected image should be set as the feature. In addition, in order to effectively avoid the formation of excessive dominant feature of pixels, the pixel value is normalized. The specific calculation formula is as follows:

In the above formula, represents the data after normalization; represents the initial data; represents the maximum value of the depth action sequence; and represents the minimum value of the depth action sequence.

Combined with the above different depth action images, feature extraction is carried out on the depth image. The specific operation steps are as follows:(1)  Projection: Each 3D depth image frame is projected into three orthogonal Cartesian plane coordinates while each image is adjusted to the same size.(2)  On the whole depth sequence, calculate the absolute difference of any two adjacent depth projections, and the specific calculation formula is as follows:(3)  Restructuring: In order to prevent the variation of the video sequence, the size of the whole DMM is adjusted by binary and triplet interpolation.

In the process of deep learning, the input of two-dimensional convolutional neural network is usually an image, that is, a feature. If there are multiple features in the image, multiple convolutional networks are used for feature extraction and recognition, but in the whole process, the amount of calculation is too large. Using 3D convolutional network to carry out the above operations will lead to a significant increase in time and space complexity. In order to effectively solve the above problems, the convolutional neural network is comprehensively improved [7, 8]. After the improvement, multiple image features can be input at the same time, and the convolutional kernel is two-dimensional, the overall complexity of the algorithm is also effectively reduced, and the computational efficiency is greatly improved [9].

The improved convolutional neural network is mainly composed of the following parts, respectively:(1)Two convolution layers: a convolution kernel is used to slide on the input matrix; data are formed according to the corresponding position of the input matrix of the convolution kernel, and all the results are added up to obtain the final convolution result.The convolution kernel is equivalent to the feature extractor, which automatically extracts the feature information of the data through the network model [10]. The calculation formula of the convolution process is given in detail in the following formula:(2)Two pooling layers: the pooling layer, also known as the lower sampling layer, is equivalent to dimension reduction. It can not only reduce the amount of data but also ensure that the effective information of the image can be saved, so as to enhance the antidistortion ability of the whole network.Pooling can generally be divided into two different forms: average pooling and maximum pooling.(3)Two fully connected layers: the full connection layer is usually located behind the convolution layer and the pooling layer and usually contains one or more full connection layers in the network model. It mainly integrates local information in the convolution layer or the pooling layer to obtain the total number of feature vectors. The last full connection layer is also called the output layer, and the number of nodes in the output layer is the same as the number of categoriesThe improvement of convolutional neural network specifically includes the following aspects:(1)Improvement of convolutional neural network structure: by multidimensional input processes, two-dimensional convolution processes. Simultaneous input of multiple features can reduce the loss of image information rate and the complexity of the whole recognition process.(2)Addition of random deletion in the convolution layer and the full connection layer: in the training process of the whole model, random deletion will place some hidden neuron nodes in the inactive state with a certain probability, resulting in the deletion of adjacent nodes, among which these deleted nodes have no use value in the training model. However, after the above operation, the number of training parameters will be reduced, which can effectively prevent its over-fitting.(3)The modified linear element is used to replace the traditional saturated nonlinear function.

The specific calculation formula of the modified linear element is given as follows:

On the basis of the above analysis, the convolutional neural network is combined to effectively realize human action recognition.

3. Simulation Experiment

In order to verify the effectiveness of the proposed method, a simulation experiment is required. The experimental environment is Windows10 and Matlab R2017b. The experimental data come from the UCI dataset, which contains 507 training sample sets.

3.1. Comparison of Image Deredundancy Processing Results

In the actual calculation process, the number of image blocks will have different degrees of influence on the recall rate and clustering time of the algorithm, and the experimental test is carried out on it. The human motion recognition method based on spatiotemporal image segmentation and interactive region detection proposed in literature [1] was taken as Method 1. The video human motion recognition method based on the CNN feature of the training graph proposed in literature [3] was taken as Method 2, and the specific experimental comparison results are shown in Tables 13.

Comprehensive analysis of the experimental data in the chart above shows that with the continuous increase in block number, the recalling rate of various methods and clustering of time in the corresponding change too, but compared with other two methods, the recall rate of the proposed method is significantly higher than that of the other two methods, and clustering time is significantly lower than the other two methods, thus fully validated the effectiveness of the proposed method and the superiority.

3.2. Recognition Rate/(%)

In order to verify the accuracy of the recognition results of the proposed method, the recognition rate was selected as the evaluation index in the experiment. The higher the recognition rate, the better the recognition effect. The recognition rate comparison results of the three identification methods are given in detail Figures 24.

Figure 2 shows comparison results of the change of recognition rate of different recognition methods.

Based on the comprehensive analysis of the experimental data in Figure 2, it can be seen that the recognition rate of the proposed method is the highest among the three methods. The main reason is that the proposed method adopts the improved convolutional neural network for feature recognition on the basis of the traditional recognition method, which can effectively increase the recognition rate of the whole method.

3.3. Recognition Speed (unit/second)

The following experiment compares the recognition speed of the three methods, and the specific experimental comparison results are shown in Tables 46.

By comprehensive comparison of the experimental data in Table 6, it can be seen that the proposed method has a faster recognition speed.

4. Conclusion

Aiming at the problems of low recognition rate existing in traditional recognition methods [11, 12], a human motion recognition method based on data redundancy technology was proposed, combining with data redundancy technology [13, 14]. Experimental results show that the proposed method can effectively eliminate the redundancy in the image, improve the recognition rate and recognition rate, and obtain a relatively satisfactory recognition effect.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare they have no conflicts of interest regarding the publication of this paper.

Authors’ Contributions

All authors contributed equally to this study.

Acknowledgments

This work was supported by the Academic Project Fund for Top-Notch Talents in Disciplines (Majors) of Anhui Provincial Department of Education, no. gxbjZD2020098.