Abstract

In order to provide effective information support to athletics training and to increase the effectiveness of athletics training, in-depth analysis of motion recognition sensors and attitudes based on artificial intelligence was conducted. First of all, the basic conditions of type analysis, cognitive technology, and the current situation were studied, and the basic theory related to it was studied, and on this basis, a human position analysis and recognition system based on artificial intelligence movement training sensors was developed. We studied the technology in depth. Experiments have shown that the approach data collected by the system’s inertia node is transmitted wirelessly to the computer-side software to restore the trend and identify each trend and parameter with high accuracy. During the 30-minute test, the static error was within 1° and the dynamic error was within 5°, which is acceptable and adheres well to dynamic conditions. The system can overcome the limitations of traditional wired or optical methods and be widely used in sports training, human-computer interaction monitoring, rehabilitation medicine, games, film, and television production.

1. Introduction

Human posture and action recognition are important research directions in the field of computer vision, involving the intersection of image processing, pattern recognition, artificial intelligence, and other disciplines. It has a wide range of applications. However, there are still many fundamental problems in human pose recognition in the traditional 2D vision way. With the development of image acquisition technology, especially the appearance of the Kinect somatosensory camera capable of acquiring depth information, it provides new opportunities for gesture recognition researchers. At present, depth images are mainly used for 3D scene reconstruction, and the research on human posture and action recognition is still in its infancy. Although the Kinect SDK provides an API for skeleton tracking, it does not provide a higher-level gesture recognition (Application Programming Interface, API). The main reason is that the human posture is ever-changing; it is impossible to build a general recognition model. The accurate analysis and prediction of human motion posture can provide effective data support for sports training. By obtaining the relevant data of human movement, combined with the standard database data to correct the details of athletes’ movements, we can improve the athletes’ sports level. The development of computer technology and the rapid progress in the field of artificial intelligence have undoubtedly injected a tonic into today’s society, which has driven the rapid rise of surrounding industries, and the information society is rapidly advancing in the direction of intelligence [1]. With the emergence of a series of new concepts and technologies such as “smart city,” smart home, Internet of Things, and “cloud Internet of things” and the further improvement of people’s living standards, people begin to pay more and more attention to spiritual civilization life. Attitude analysis and recognition refers to a complete set of equipment and system to solve the motion attitude of human body and generate the mathematical and physical model in three-dimensional space by using motion measurement and capture technology, so as to achieve the purpose of analysis and recognition. As shown in Figure 1, the flow chart of recognition process is shown. With the gradual development of popular technologies such as augmented reality (AR), virtual reality (VR), and somatosensory games, posture capture and analysis have attracted extensive attention from domestic and foreign scientific research institutions, scholars, and manufacturers [2]. As a complete set of system that analyzes and restores the human motion by capturing the key frames or key nodes of human motion, posture capture system has been gradually applied to the civil consumer market. There are many data measurement methods of attitude capture system, such as electromagnetic, mechanical, optical image, and inertial sensor. The most important methods are image capture based on optical camera and acquisition based on inertial sensor. If the image machine perceives the eyes of the outside world, the inertial sensors distributed in all joints of the human body can be regarded as the touch of the machine. Compared with the optical image method, inertial sensors are widely used in commercial and various aircraft because of their advantages such as low cost, good real-time performance, not affected by the installation range, and wide capture range [3]. In the field of human-computer interaction, human-computer interaction has developed from the original command line input to the current natural user interface. The intelligence of the computer makes the human-computer interaction more natural and harmonious. In the future, computers can understand people through their language, actions, and even thoughts, and people do not need to use additional input devices such as mice and keyboards to input instructions.

2. Literature Review

Ahmad et al. believe that human posture detection and motion recognition is an important topic in the field of computer vision. Its core content is to detect human targets from video sequences through image processing and analysis, machine learning, pattern recognition, and other technologies. Among them, human target detection, classification, and tracking belong to the low-level and intermediate processing stage in vision, while posture recognition, motion analysis, and behavior understanding belong to the advanced processing stage [4]. Cao et al. said that in the above four tasks, in the system based on video detection, human target detection and tracking, as the basis of the system, belong to the bottom vision problem. At present, many mature methods have been proposed. Human target detection methods include background subtraction method, interframe difference method, and optical flow method. Tracking methods include matching-based tracking and motion characteristic-based tracking [5]. Gu et al. believe that the technologies related to human posture detection and motion analysis, which are currently in the advanced processing stage, are in the hot stage of exploration and research. In recent years, many scientific research institutes are studying how to understand human posture and motion from video sequences to form higher semantic information, authoritative journals, and conferences also take this field as the main content [6]. Murniati et al. believe that in the field of human-computer interaction, human-computer interaction has developed from the original command line input to the current natural user interface. The intelligence of computer makes the interaction between human and computer more natural and harmonious [7]. Aslan et al. said that in the future, computers can understand people through people’s language, action, and even thought, and people do not need to use additional input devices such as mouse and keyboard to input instructions. In recent years, somatosensory interaction technology driven by games has become a new way of action interaction [8]. Zhao et al. said that with the continuous progress of computer science and technology, artificial intelligence technology has gradually matured. Deep neural network (DNN) can automatically learn data features to find sparse and distributed big data features, but the latest convolution technology cannot be directly applied to capture human motion data. In order to capture human motion data, the convolution filter needs to cover the whole range of human joints so that convolution occurs only in the time direction [9]. Sharif et al. proposed a fully connected network with bottleneck, which can learn to predict future moving frames and train the time encoder of human motion given the previous frame [10]. Amin et al. said that the important information of human motion exists in the dynamic human limbs. According to this principle, when representing human motion, the unchanged part of human posture can be discarded and only the changed limb information needs to be expressed. Many research methods use optical flow to represent motion information in video [11]. Kim et al. first proposed using optical flow method to identify different movements of human body [12]. Martin et al. believe that the nonmodel human posture representation represents the human posture by the optical flow, human silhouette, or contour in the image, which does not need to solve the human model parameters and simplifies the solution of human posture. However, the nonmodel human posture representation is affected by the viewpoint, human body difference, and other factors [13]. Bruder et al. installed their detector (built-in accelerometer) on the waist and used the algorithm of support vector machine to train and recognize whether a person fell. Experiments verify that the detection accuracy can reach 96.7% [14].

Based on the above researches on human pose analysis methods based on wearable devices and computer vision, the two methods have their own advantages and disadvantages. As shown in Table 1, the two methods are compared. In the direction of computer vision, if lighting and occlusion can be solved or being able to reproduce the human body model in the computer, the visual method will become a mainstream pose analysis method.

3. Method

3.1. Action Representation Based on Attitude Sequence

For the recognition of action, the action must be represented first. In the static posture, this paper uses a frame of feature data to represent. For an action, it can be regarded as a combination of a series of bone frame data. Each frame is equivalent to a static posture, that is, the action can be seen as a combination of multiple static posture sequences. In the research of traditional visual method action recognition, it is generally necessary to select several key frame image sequences to represent the action, and the action is decomposed into several poses, because if multiple image sequences are selected to represent the action, the calculation cost of image data is high, and it is impossible to meet the requirements of real-time. This paper hopes to express the action through the continuous bone frame data, that is to say, feature extraction is carried out on the continuous bone data to form the action feature sequence. In this way, the description of the action is more appropriate [15]. Based on the above considerations, this paper will represent human actions through continuous pose sequences over a period of time. Then, for an action , it is shown in Formula (1). where represents an action, represents the human posture corresponding to frame . Here, the distance features extracted in this paper are used, so is the distance feature set extracted from a frame of bone data. Then, contains 24 eigenvalues, static sequences represent an action, and the eigenvalues representing an action are 24 -dimensional [16].

Build temporal encoding-based models to accurately identify human motion patterns. Three variants of the recognition model were used: symmetric encoding S-TE, time scale encoding C-TE, and structure encoding H-TE. The encoding-decoding framework is used to compute the projection of the high-dimensional input data onto the low-dimensional graph and predict the output data based on the projection. The high-dimensional input data is optimized by an autoencoder, as shown in Equation (2). where the encoder maps the input data to a low-dimensional space .

An alternative approach is used in this system to capture temporal correlations of human motion data rather than static representations of human poses. Let be the observed value at time , and the optimization function of the time encoder is shown in Equation (3).

3.2. Identification of Time Series

Actions have spatiotemporal attributes. Actions are transformed into pose sequences in space, and the spatial model of actions is established. However, time needs to be considered. For the same action, the same person will do it differently, and different people will do it differently [17]. The experimenter makes an action of lifting his hands on the side of the body, collects three-dimensional point data, manually intercepts 200 frames of bone three-dimensional point data from the data stream of this action, and reproduces the result of the bone three-dimensional point data of this action in Matlab. As shown in Figure 2, the change curve of the characteristic value of the distance between the left hand joint point and the spine point in the direction with time is shown. As shown in Figure 3, the change curve is completed with a slightly slower action.

It can be seen from the figure that when the same person performs an action, the waveform of the curve is similar, but there is an offset between the curves that complete an action at different speeds twice. Objectively speaking, when people do actions, there will be fast and slow situations, which are the natural expression of actions and must have a certain randomness, which is inevitable [18]. Therefore, it cannot be guaranteed that a person will keep a speed when doing the action, and let alone that different people will keep a speed and spend the same time when doing the same action. If an action is completed quickly, the posture sequence of the action will be short; if an action is completed slowly, the posture sequence of the action will be long. In pattern recognition, it is based on the waveform similarity of two actions to judge whether they belong to the same action. The most common method of this similarity is to calculate the Euclidean distance between vectors. If template matching is used, it is to directly calculate the Euclidean distance between the features of two actions at the same time point, but the length of the two sequences for calculating the Euclidean distance should be the same [19, 20]. Then, suppose that the sequences with different lengths representing the same action are scaled on the time axis to make the two sequences have the same length, which makes the two waveforms with different lengths more similar. This idea is dynamic time warping.

3.3. Dynamic Time Warping Principle

Action recognition can be regarded as the recognition of time series. For time series processing, the common task is to compare the similarity of two sequences [21]. However, there are two problems in time series recognition: (1) the length of two time series that need to compare similarity may not be equal, and the general recognition algorithm requires the length of feature vector to be equal. (2) Even if the length of the two sequences is equal, the eigenvalues of the two sequences at the same time point may not be comparable. In order to solve this problem, a dynamic time warping algorithm is proposed to solve the irregular optimization algorithm of two time series on the time axis, which is used to solve the problem of speech recognition. In speech recognition, the randomness of speech signal is strong, the speech speed of the same voice by the same person will be different, and the same voice by different people will be different, resulting in the change of characteristic parameters, affect recognition rate [22]. The overall idea of DTW algorithm uses the idea of dynamic programming to find the matching path with the smallest distance between two sequences of different lengths. This matching path is the mapping relationship between points on the sequence. This eliminates the difference on the time axis, reduces the distortion between the two sequences, and maximizes the overlap. For example, as shown in Figure 4, the solid line and dotted line represent different sequences of the same signal, respectively. It can be seen from the figure that the change trend of the two sequences is roughly the same, but they are not aligned on the time axis. In Figure 5, the aligned points of the two waveforms are found after regularization by DTW algorithm. Calculating the distance of the aligned points is the real similarity of the two sequences.

Dynamic time warping can well solve the problems of unequal length of time series and irregular time points. At present, it has been widely used in speech recognition, handwriting recognition, visual matching, and other fields [23].

4. Results and Analysis

According to the position and speed information of sampling points and key points, gesture segmentation is carried out to determine the gesture trajectory to be recognized. Considering that the failure of gesture segmentation will directly affect the subsequent recognition, the processing measures when the segmentation fails are added, and the current gesture detection is skipped in time to enter the next gesture detection. After the successfully segmented gesture trajectory is obtained, it enters the stage of preprocessing and feature extraction. If the timing characteristics of the track are considered, that is, the sequence of each track point is certain, first align the timing of key points with the standard timing and stretch the timing coordinates of other sampling points accordingly. Then, the interference points caused by detection error and gesture instability are filtered, and the trajectory point coordinates are normalized [24]. If the timing characteristics are not considered, the timing alignment is skipped and denoising and normalization are carried out. After preprocessing, the trajectory points that are more convenient to extract features and template comparison are obtained. A variety of features such as direction vector, density, and extreme points of each point are extracted, and the weight is allocated according to the importance. Finally, enter the matching stage, compare each characteristic value with the corresponding characteristic value of the template, calculate the Euclidean distance, and get the matching degree with each template. If there are multiple similar matching degrees, enter the secondary matching, otherwise directly take the template with the highest matching degree as the recognition result. Users can feed back the recognition results. If the recognition is wrong, submit the track and correct results to the background to optimize the template until it meets the output requirements.

Key point detection algorithm. The hand will stop at the key action, resulting in relatively dense points near the key position. The density curve is divided into several regions, and each region looks for the maximum density point greater than the threshold as the key point. As shown in Figure 6, the abscissa represents a certain time interval, and the ordinate represents the number of sampling points in the interval [25].

Timing alignment algorithm. The gesture trajectory is a time series. The length of the time series of the two gestures that need to compare the matching degree may not be equal, which is specifically reflected in the speed of the gesture. Therefore, before comparing the matching degree, it is necessary to stretch the time series of gestures to be matched appropriately and ensure that the time series of the same key points correspond one to one, so as to achieve better alignment. Dynamic time planning (DTW) algorithm is used to achieve this effect. As shown in Figure 7, a matrix grid is constructed. The abscissa is the template time series, and the ordinate is the gesture trajectory time series. The matrix elements ( and ) passed by the broken line represent the sampling points with corresponding relationship between the template time and the gesture time. The path is obtained by dynamic programming [26]. Using as few sampling points as possible can improve the computational efficiency of the algorithm.

From the test results, the recognition rate has decreased. The average recognition rate is 88.7%, a decrease of 2.7 percentage points. The reason for the decrease is that when recording the standard template, the person who recorded the standard template strictly followed the action essentials, while the experimenter did it. During the test, the recognition rate was reduced due to personal reasons (fatigue, etc.), and the movement was not in place due to personal reasons, as well as the deformation of movements between individuals. In general, the recognition rate can meet the requirements of general interactive applications. This experimental program is not limited to training and recognizing these gestures and can also recognize slightly complex actions (such as standing up and sitting down), but the selection of too many action description features may reduce a certain recognition rate, so, the recognition content and recognition range should be considered when selecting features, and an action recognition system should be designed for the purpose of application. At present, gesture tracking and recognition is a hot research and development topic. This project has made good progress in a number of technologies, including denoising and feature extraction technology based on Bezier curve and various standard gesture data acquisition methods based on leapmotion. Gesture recognition studied in this project is a key technology in the field of human-computer interaction, which provides guidance for the development of the industry [27]. The application of gesture recognition technology makes people gradually get rid of the traditional input methods and provide more diversified and humanized services for human beings.

In this chapter, the recognition method based on Kinect action is studied. First, it solves the problem of how actions are represented under the Kinect platform and analyzes the strong randomness of actions on the time axis. Combined with the idea of dynamic time regulation, a design based on the action recognition algorithm of DTW, an experimental action recognition system is implemented, and 6 actions based on interaction are defined. The experimental verification can achieve a high recognition effect and meet the requirements of interactive applications. Finally, the application value of the system is discussed.

5. Conclusion

The attitude analysis and recognition system based on artificial intelligence can not only measure human motion information but also obtain human motion feature data and motion state through the analysis of attitude data. The research on posture analysis and recognition system has high scientific and theoretical significance, commercial value, military value, and social value, such as stage performance, rehabilitation medicine, special effect production, game interaction, and sports training. The market potential is very huge. In this paper, a human posture analysis and recognition system is designed based on artificial intelligence technology. Through this system, human motion information is collected, analyzed, and recognized, and the posture solution and recognition algorithm are studied and realized. Firstly, a feasible calibration scheme is applied to solve the inaccurate solution caused by the inconsistency between the output of a single inertial sensor and the real value. Then, according to the characteristics of the system, an attitude solution algorithm based on gradient descent algorithm is designed to minimize the impact of system noise and estimation error. An attitude analysis and recognition method based on attitude angle is proposed. The feasibility and performance of the system are also verified by experiments.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.