Abstract

This paper proposes the study of motion video image classification and recognition, extracts the motion target image features, designs the image classification process, and establishes the neural network image classification model to complete the image recognition. In view of the different angles of the same element, the motion video image classification and recognition under the neural network are completed by using the error back-propagation algorithm. The performance of the proposed method is verified by simulation experiments. Experimental results show that the proposed method has a high recognition rate of moving video images, the accuracy rate is more than 98%, and the image recognition classification is comprehensive. The proposed method can classify the elements in the motion video image, which solves the technical problem that the traditional method cannot identify unclear images and has low recognition accuracy.

1. Introduction

With the advent of the era of big data, pictures have become the main carrier of information transmission and one of the main information transmission tools in people’s life. With the increase of digital image requirements, image recognition technology has become the mainstream of current research. At present, the research on image recognition technology at home and abroad has a history of 100 years, but there are still some problems, such as low image separation rate, environmental interference affecting image recognition, slow image recognition speed, and the contradiction between computation and operation accuracy. To solve the above problems, scholars have proposed some methods when studying image recognition technology, such as nonparametric kernel density estimation method, Stauffer and Grimson Gaussian mixture background modeling method, and multitarget recognition method based on Gaussian mixture model. However, there is still a lack of an adaptive update rate method to solve the problem of image recognition and improve the effect of image recognition. But when it is used in video image recognition, once the recognized image changes in displacement, scale, rotation, and distortion, there will be errors in the recognized image. Therefore, some scholars have proposed optical flow method, block matching method, and difference image method to identify moving video images to improve the recognition accuracy of images. However, it is difficult to recognize moving video images in real time when the target is moving. Therefore, this paper proposes a study on motion video image classification and recognition based on neural network and uses the distributed information storage and adaptive parallel processing characteristics of neural network to process motion video images in real time.

The application of modern video and image processing technology in sports training is also gradually developing in the direction of intellectualization and scientization. There is a method of using video frame sequence analysis to collect the action characteristics of sports training. This method can correctly extract the action characteristics of sports training in irregular sports, which is helpful to improve the level of sports training. The embedded control chip is used to design and develop the system, which can realize the communication and real-time monitoring of motion video information. Based on the Internet of Things environment, the integrated control of sports training video information can effectively guide sports training. In the process of sports training video analysis and design, the focus is on the design of hardware and software. Based on the hardware creation of video collection and analysis system, the software design of sports training video analysis system is analyzed. Traditional motion video analysis methods all use embedded design, but, in the process of increasing interference, the automatic control and scheduling of video analysis system will be distorted, and the control performance will be reduced. Meanwhile, the neural network technology can just make up for this technical flaw.

Neural network is an important machine learning technology and it is a mathematical model of information processing using structures similar to neural synaptic connections in the brain. Its most popular research direction is deep learning. Neural network has a wide and attractive prospect in system identification, pattern recognition, intelligent control, and other fields. Especially in intelligent control, people are particularly interested in the self-learning function of neural network and regard this important feature of neural network as one of the key keys to solve the problem of adaptive ability of the actuator in automatic control. Neural networks are based on neurons. Neuron is a biological model based on nerve cells of biological nervous system. Neuron mathematical model is produced by mathematizing neuron when people study biological nervous system to explore the mechanism of artificial intelligence. A large number of neurons of the same form are connected to form a neural network. Neural network is a highly nonlinear dynamic system. Although the structure and function of each neuron are not complex, the dynamic behavior of neural network is very complex. Therefore, neural networks can express various phenomena in the real physical world. Neural network model is based on the mathematical model of neurons. The neural network model is represented by network topology, node characteristics, and learning rules. Applying neural network technology to the classification and recognition of motion video images can solve the technical constraints of traditional processing methods that are difficult to recognize motion video images in real time. In this paper, the accuracy of neural network technology is verified by experimental comparison and demonstration, and the image classification process is designed, and the neural network image classification model is established to improve the accuracy of image recognition.

2. State of the Art

Motion video analysis refers to the analysis, description, and utilization of global motion information (background motion caused by camera motion) and local motion information (foreground object motion) presented by video. It involves image processing, computer vision, pattern recognition, and other related technologies. Because motion video analysis is widely used in video synthesis, video segmentation, video compression, video registration, and video surveillance, motion video information analysis has become a research direction in the field of computer vision in recent years. A large number of articles on motion video analysis have been included in some authoritative international journals and important academic conferences. For example, Zili Niu built a target tracking model based on the Markov chain of sports training videos and realized the synchronous playback of sports training videos and the synchronous tracking function of multiple dynamic goals through the studio [1]. In order to better meet the actual needs of sports training, develop a new generation of sports technology video analysis system with more powerful functions, and maximize the application potential of digital video in competitive sports training [2], the following key technologies of motion video analysis need to be further studied: global motion estimation technology, motion video panorama synthesis technology, video moving object extraction technology, video moving target tracking technology, video content annotation technology, and so on. By analyzing the performance characteristics of these technologies, the feasibility of motion video image recognition based on neural network technology is discussed.

2.1. Global Motion Estimation Technique

Global motion refers to the motion of pixels occupying a large proportion in the video sequence. Video images usually consist of foreground and background; if the camera is in the process of shooting sports, foreground objects at the same time also have their own motion, so in video sequence they are displayed in the background: background motion is caused by the camera movement; a large proportion of pixels in the video sequence formed a global movement [3]; the motion shown by the foreground object is the motion of the foreground object relative to the camera, which is called local motion. The purpose of global motion estimation is to find the motion rules of the camera which cause global motion from the video sequence. In general, there is obvious camera motion in motion video, so obtaining accurate global motion parameters is the basis of motion technology video analysis.

At present, global motion estimation methods are mainly divided into differential method and feature point correspondence method. The main difference is that the former uses the velocity field in the pixel field of the image and the latter uses the correspondence of feature points. The differential method is a featureless point method, which has been widely used to solve global motion estimation. These algorithms define an objective function according to the velocity field in the pixel field of the image and then use the numerical optimization method to solve the optimal motion parameters. The basic principle of the corresponding method P based on feature points is to find out enough coordinates of image points from the same object under different perspectives in two consecutive frames and obtain global motion parameters by solving the superlinear equation [4].

At present, whether it is the differential method or feature point correspondence method that is used to estimate global motion parameters, due to the existence of noise data such as wrong feature matching and local motion of foreground object (its motion attribute is inconsistent with global motion), the estimation accuracy will be reduced, and even the accuracy of further motion analysis will be seriously affected. In addition, the existing global motion estimation methods are usually very computationally intensive. Therefore, as the basis of motion video analysis, the accuracy and speed of global motion estimation need to be improved.

2.2. Motion Video Panorama Synthesis Technology

Panorama synthesis refers to obtaining a single image showing the whole scene from a series of local images describing a continuous scene. In traditional applications, the panorama captures the static background E5 of the scene without moving. In recent years, new requirements for panorama synthesis have been put forward to assist in and guide sports training [5]. The motion video panorama not only needs to completely build the global background of the motion scene but also is required to show a series of foreground images, namely, athlete images, on the background, so as to completely show the track of motion and action details. Compared with sports video itself, panorama provides more intuitive and comprehensive information, which can help coaches and athletes statically analyze the completion of movements and master the essentials of movements globally. In sports video, the camera needs to move and zoom quickly in order to track the movements of athletes, and the foreground represented by athletes usually has obvious movement relative to the background environment. Motion video has strong continuity, intense motion, and prospect, which bring about great difficulty to panorama synthesis. At the same time, how to effectively extract athletes’ images and clearly superimpose them on the generated background panorama is also a content that needs to be studied [6].

2.3. Video Moving Object Extraction Technology

The purpose of video moving object extraction is to separate the moving object in the scene from the background, which is the basis of object-based motion analysis (such as motion overlay comparison). The existing video motion object extraction methods can be simply divided into two categories: one is based on temporal sequence attributes (4) video motion object segmentation based on temporal sequence attributes (such as frame difference, optical flow, etc.); the other category is based on the spatial attribute as the segmentation basis 5′, mainly according to the region or edge information of the image to segment the moving object. However, whether using temporal or spatial attribute in motion object segmentation, revealing the background and object of irregular movement (in one part of the movement object or its static) over a period of time can cause the loss of the segmentation accuracy, because the two types of motion object segmentation method have to sport information and background, and, in motion analysis, exposed backgrounds and static foreground areas caused by the irregular movement of objects are susceptible to being incorrectly detected as foreground or background.

2.4. Video Moving Target Tracking Technology

Video moving target tracking technology is based on location, motion, shape, texture, color, and other features; establish the image structure correspondence 0′ between the continuous images, which can be used to lock the moving target in the video, and automatically obtain its motion trajectory. Moving target tracking is an important research field in computer vision. Target tracking methods can be divided into three categories: feature point based tracking, region based tracking, and contour based tracking. (1) Feature point based tracking is the most commonly used tracking technology, which is suitable for fast tracking of objects with small amplitude translation. This method usually selects one or several feature points from the target and finds the matching points in the subsequent image, so as to obtain the position of the tracked target in the subsequent image. The applicable objects of this method have certain limitations. (2) The region based tracking method estimates the motion of the region belonging to the tracking target, maps it to the next image frame, and then merges the mapped region into an object according to certain criteria. In the process of tracking, it is necessary to segment and merge the subsequent images, so the tracking based on region needs a lot of computation. (3) The deformation method based on contour tracking is to track the contour of the object and use closed expression to form a deformation curve when the motion information is automatically updated. The contour tracked by this method can better deal with the deformation of the tracked moving target, but when the motion range or contour deformation exceeds the range of the deformation curve, this method will fail. In moving target tracking, the commonly used mathematical tools are Kalman filter and particle filter. Kalman filter is a linear filter, only when the moving target state transfer equation meets the linear condition can it get good results. Particle filter is a filter based on factor sampling, which has strong robustness and noise resistance [7].

2.5. Video Content Annotation Technology

The purpose of content annotation for sports video is to achieve effective content management for a large amount of sports video data and help users to quickly find the required video content. For motion video, content annotation mainly refers to analyzing semantic information by extracting and processing various features of motion video. Common features include visual, auditory, and textual features. Visual features include color, shape, texture, and movement. Auditory features were extracted by processing game-related sounds such as shots and cheers from spectators and announcers. The features of the text come from two aspects: one is the words on the video screen and the other is the hidden subtitles in the TV signal.

According to the level of semantic analysis, the existing research work at home and abroad can be divided into three categories from primary to advanced, namely, scene classification (8′), highlight extraction (9′), and event detection (0). Scene classification is a preliminary division of the content of motion video. For example, the video can be divided into two scenes, the game is going on and the game is paused, so that users can skip the pause time when watching the video. Highlight extraction refers to the identification of the more important or interesting segments in sports games to form highlights. Event detection, in a narrow sense, refers to the detection of some domain-related and recurring semantic events in sports videos, such as athletes’ diving in diving matches and shooting in football matches. By annotating and organizing these events, you can ultimately satisfy the query needs of most users. In a broad sense, the essence of an event is a video clip with certain semantics. Thus, a scene or highlight can be considered an event of some kind. So some people call sports videos “videos of events.” In this sense, we believe that the core technology of motion video content analysis and annotation is actually the analysis of events and their relationship [8].

Manual annotation of video content requires a lot of work, so it is necessary to develop automatic annotation technology for semantic events of motion video [9].

2.6. Requirements of Motion Video Analysis System

The traditional sports video analysis system is mainly realized by the method of human eye observation and analysis, which has serious deficiencies and limitations [10]. The purpose of sports analysis is to make full analysis of athletes’ training and competition videos. Through the correlation of video images in space and time, relevant parameters of human kinematics can be obtained, and coaches and athletes can also get the information they want, so as to guide training scientifically. In this paper, according to the related software development platform, the aim is to achieve sports training video analysis system design, which has a certain practicality.

3. Methodology

The method of motion video image classification and recognition based on neural network is studied.

3.1. Extracting Image Features

In the feature extraction of moving target, the most obvious feature is the height-to-width ratio of moving target. Therefore, when extracting the moving target, it is necessary to judge the binary image of the moving target, extract the features of the peripheral contour, and get the shape of the peripheral contour, so as to calculate the aspect ratio of the moving target and get the shape characteristics of the aspect ratio of the moving target as is shown in Figure 1 [11].

Due to the extraction of human motion features, there will be uncertainty, complexity, and irregular characteristics. Therefore, it is necessary to extract the dispersion shape feature of moving target image. The feature extraction of moving target image is based on the ratio of the square circumference of the target contour to the contour area, and the dispersion of moving target is obtained. In the process of calculation, it is necessary to pay attention to the position of two adjacent points of the contour. When two points are not in the same vertical and horizontal direction, the distance between two adjacent points of the contour should be set as 2. When two points are in the same vertical and horizontal direction, the distance between two adjacent points of the contour should be set to 1. The video recorded by the moving target will produce a certain degree of deformation in the process of movement. However, the deformation degree of the target will be different according to the different targets. For example, in the process of walking, the arm swing and the leg swing will produce a large degree of deformation. Therefore, the deformation degree of the moving target is extracted by calculating the standard deviation of the included contour angle, while, for targets with small deformation, we only need to extract their dispersion to complete the image feature extraction. The extraction process of deformation degree shape features is shown in Figures 2 and 3.

In Figures 2 and 3, θL represents the included angle l at the lower left corner of the contour; θr represents the angle included by the lower right angle r of the contour; represents the center of mass of the moving target contour; Sigma represents the standard deviation; θI represents the angle value of the ith frame in the contour; that is, I = 1, 2,...,10, where 10 is the maximum frame number selected for the extraction of deformation degree. Based on the above moving video image shape feature extraction methods, the moving target features are extracted, and the target features are classified and recognized according to the feature extraction results.

3.2. Image Classification

Based on the moving target feature threshold obtained from image feature extraction, the image classification process is set, as shown in Figure 4.

In Figure 4, R represents the feature of aspect ratio; D represents deformation degree; σ represents the dispersion characteristic. According to the design of image classification process to complete the image classification, at this point, the neural network can be used to identify the image as is shown in Figure 5.

3.3. Image Recognition Based on Neural Network

Using neural network to identify images, it is necessary to establish a training sample library based on neural network. In order to improve the recognition accuracy of image classification and recognition method, different angles of the same element will be photographed and acquired, and neural network will add sensitivity to image samples [12].

According to the motion video images of this study, the D.D. Ebb learning rules in the neural network were selected to train the training samples established this time. The training sample expression of this rule iswhere ωij represents the modified value of the ith weight of the jth learning rule; xi represents the input sample of the ith weight in the input neural network; yi represents the output sample of the ith weight; λ indicates the parameter for adjusting the learning speed.

At this point, the numbers of input layers, output layers, and hidden layers of neural network can be determined. Therefore, it is assumed that the target to be recognized in the motion video image is M, and the number of neurons in the output layer of the neural network is equal to the number of targets to be recognized, so the number of neurons is also M, and the training samples belong to the jth class of M neurons. Then the output vector of the neural network is

In the above formula, y indicates that the maximum number of festival points of the god-containing layer is NM. According to the numbers of input layers, output layers, and hidden layers determined above, the neural network image recognition model is established, as shown in Figure 6.

In Figure 6, n represents the number of input layers; M represents the number of output layers; Omega is the weight. Based on Figure 6, during image recognition, s-type functions should be selected as the threshold value of neurons so that the input training samples can be trained along the established neural network path and the training samples can be output. In order to increase the accuracy of neural network model recognition, an error back-propagation algorithm is used to train neural network [13]. Suppose that the output layer of the neural network is used.

Under a certain input, the expected output value is Bq, and the real output value is Oq, where Q represents the Q layer. Then the mean square error EQ is expressed asIn the above formula, N represents the maximum number of output layers. At this point, according to the mean square error determined in the formula, the relationship between the weight and error correction is calculated. Suppose that the learning operator of the neural network is η, representing the weight correction increment; then

In the above formula, δq represents error correction; P represents the output layer of the layer above Q layer; Op represents the actual output value of the P layer. At this point, the above calculation process is substituted into the neural network image recognition model to complete the image recognition. In this study, the classification and recognition method of moving video image is adopted to first extract the shape features of the image, classify the moving target, determine the input layer, output layer, and hidden layer of the neural network, and reverse calculation to increase the accuracy of image recognition and complete the design of the neural network [14] as is shown in Figure 7 [15].

Regression analysis attempts to obtain more information from the interaction effects of genotypes and the environment. Since 1938, when Yetes et al. analyzed GE interactions by regression methods, linear regression methods have been continuously developed, and Hu Bingmin et al. have discussed in detail various linear regression models. Yunli only introduces the commonly used joint regression model, also known as the F-W model. It further breaks down the interaction terms in ANOVA into linear regression of genotypes to environmental effects; that is,

Substitute it into formula (1) available.where is the regression coefficient for the ith genotype.

The research framework for this article is shown in Figure 8.

4. Result Analysis and Discussion

4.1. Experimental Demonstration and Analysis

This experiment demonstrates and analyzes the classification and recognition methods of motion video images, selects UCF101 dataset as the research object of this experiment, adopts the experimental method of comparative demonstration to verify the method in this paper, and adopts the Ubuntu 14.04 platform to build the running environment of this experiment on the computer according to the characteristics of the comparison method selected in this experiment. The method in this paper was denoted as experimental group A, and the differential image method and video frame sequence analysis method, the two motion video image classification and recognition methods mentioned in the literature, were denoted as experimental group B and experimental group C, respectively. The experimental environment and the experimental object were determined, and the recognition rate, classification recognition result, and image recognition accuracy of the three methods were compared as shown in Figure 9.

4.2. Preparation for Experiment

In this experiment, spiIT1 partition list in UCF101 dataset will be used to divide training set and test set according to the size of the actual number of tests. The training sample dataset contained 9537 videos, and the test sample set contained 3783 videos [16]. The length of the training videos was maintained at 5 min 11 s, with a difference of 1 min. In order to ensure the rigor of the experiment, all the experimental data were preprocessed by image. In order to prevent network fitting phenomenon of motion video, random clipping and horizontal flipping were adopted to increase training samples, and angle clipping and zoom jitter were adopted to adjust the focusing position and scale of images. All preprocessed video images were adjusted to 256 × 340 size and frame rate to was adjusted to 25 F/s. The resolution was adjusted to 224 × 224 and normalized to a 1 × 784 dimensional vector.

Select Ubuntu 14.04LTS on the Linux operating platform and set up.

Three groups of motion video image classification and recognition methods were used to identify the motion video selected in this experiment. The experimental environment is shown in Table 1. For comparison of image recognition rates in the process of image recognition, in order to avoid inconsistent image recognition rates among the three groups of methods in the test set and training set, Matlab 2015 simulation platform is adopted, and the extraction results of recognition rates are shown in Figure 4. As can be seen from Figure 4, although the recognition rate of experimental group C was very close at the beginning, with the increase of the number of iterations, the recognition rate of training sets with more images was low and only suitable for recognizing datasets with fewer images. Although the recognition rate of experimental group B was significantly different at the beginning, with the increase of iterations, the recognition rate of the two sets of datasets was approaching, but the image recognition rate of the training set was still lower than the data recognition rate of the test set. As the number of iterations increased, the recognition rate of experimental group A for training set and test set was approaching [17]. It can be seen that the moving video image classification recognition method in this study is not affected by the number of images, and the image recognition rate is relatively high [18].

4.3. Image Classification and Recognition Comparison

A video in the UCF101 dataset video was selected as the test object of this group of experiments, and part of the experimental test video used is shown in Figure 5. Three groups of methods were, respectively, used to classify cars and people in the image. The Matlab 2015 simulation platform was still used to extract the test results of the three groups of methods and the classification results of elements in the video, as shown in Figure 6. As can be seen from Figure 6, experimental group C can only recognize clear images, and the position of recognized image elements must be in the center of the image, and it is difficult to recognize side and unclear images. Experimental group B could recognize the image on the side of the video, not limited to the element in the center of the video, but could not recognize the unclear image. However, experimental group A could classify all the car and human elements in the image and identify the unclear elements in the image without any requirements for the location of the elements. Thus, it can be seen that the moving video image classification and recognition method in this study can recognize the elements in the image and classify them.

4.4. Comparison of Image Recognition Accuracy

The recognition accuracy is introduced into the experiment. Assuming that the number of correctly recognized samples is R and the total number of tested samples is T, the expression of image classification recognition accuracy P is as follows:

The results of image recognition and image classification in the first two groups of experiments were counted, and the accuracy of image classification and recognition methods in the three groups was compared. The results are shown in Table 2. As can be seen from Table 2, experimental group B is obviously affected by the number of detected tests, and the accuracy rate tended to increase first and then decrease with the number of detected tests. When the number of detected videos was 138, the accuracy rate reached a maximum value of 97.69% [19]. When the number of detected videos reaches the maximum, the accuracy rate of image classification recognition also drops to 95.01%. The maximum difference in the accuracy in group C was 1.85 < 2, indicating that its accuracy was less affected by the number of tests, but the classification recognition accuracy was significantly lower than that of experimental group A. Only experimental group A was less affected by the number of detections, and the accuracy of image classification recognition was significantly higher than that of experimental groups B and C. It can be seen that the moving video image classification and recognition method in this study is not affected by the number of images detected, and the image classification and recognition accuracy is high [20].

Based on the above three experimental results as shown in Table 2 and Figure 10, it can be seen that the moving video image classification and recognition method in this study has a high classification and recognition rate of video images, which is not affected by the number of videos. It can classify and recognize unclear images and classify the elements in the images, and the recognition accuracy rate of images is above 98% [21].

5. Conclusion

To sum up, the research on the motion video image classification and recognition method makes full use of the distributed information storage and adaptive parallel processing capabilities of neural network and improves the recognition rate of unclear images to more than 98%. However, this method does not consider the complex background of moving video images and only studies the image itself. Therefore, in the future research, the image background should be added to the moving video image classification. Maybe we can try to identify and analyze the image background to further improve the effect of the moving video image classification and recognition on image elements classification and recognition.

Data Availability

The labeled dataset used to support the findings of this study is available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

This work is supported by the Reflection and Optimization of China’s Supply in Sports Public Service from Prospective of Optimistic and Healthy Aging (Project code: SK2021A0561).