#### Abstract

A sports training video classification model based on deep learning is studied for targeting low classification accuracy caused by the randomness of objective movement in sports training video. The camera calibration technology is used to restore the position of the target in the real three-dimensional space. After the camera calibration in the video, the sports training video is preprocessed. The input video segment is divided into equal length segments to obtain the subvideo segment. The motion vector field, brightness feature, color feature, and texture feature of the subvideo segment are extracted, and the extracted features are input into the AlexNet convolutional neural network. ReLU is used as the activation function in this convolutional neural network. Local response normalization is used to suppress and enhance the output of neurons to highlight the performance of useful information, so that the output classification results are more accurate. Event matching method is used to match the convolutional neural network output to complete the sports training video classification. The experimental results of the proposed study show that the model can effectively solve the problems of target moving randomness. The classification accuracy of sports training video is more than 99%, and the classification speed is faster which is shown from the results of the experiments.

#### 1. Introduction

With the rapid development of multimedia technology, sports get unprecedented attention and development. The mainstream research work of sports training video includes field and ground wire detection, player detection, recognition and tracking, camera calibration, event detection, and video abstract extraction. The classification of sports training video based on semantic information refers to the use of machine vision technology to automatically identify the types of sports training on the field and give the recognition results by using a certain way of expression [1]. Due to the extensive influence of sports, the introduction of machine vision technology and machine learning technology in sports training video classification has great potential commercial application value.

At present, there are few researches on sports training video classification. Zhu et al. used Gaussian mixture model to achieve player detection. The multitarget tracking method based on support vector regression particle filter was used to extract the trajectory of players and football, and the interactive space-time information between players and football trajectory was used to achieve tactical behavior expression and recognition in football game. Niu et al. achieved camera calibration by detecting and tracking the ground wire in the video image and finally achieved tactical behavior expression and recognition by using the space-time trajectory information of the interaction between players and football in real space. Matej Perse et al. proposed a two-stage framework to realize the tactical behavior recognition in basketball games. In the first stage, players’ trajectory is segmented according to the Gaussian mixture model under the generalized context information in basketball games. In the second stage, players’ trajectory is semantically expressed according to the key information, and the tactical behavior recognition is realized by using the template matching method. Chen et al. designed an automatic recognition system, which realized camera calibration by field line detection, and realized attack and defense pattern recognition in basketball game by using player trajectory description in the field. Masui et al. used background subtraction to detect players and then represented the spatial distribution of players in different areas of the field by using symbol system, to realize football tactical behavior recognition. This idea was a nontracking tactical behavior recognition method. The existing tactical behavior recognition mostly used the target trajectory as the underlying visual feature, which faced many problems. Firstly, due to the mutual occlusion between targets, the randomness of target movement, and the complexity of the environment background, there are still many problems in the accuracy and persistence of target tracking; secondly, because the sports training video is mainly based on long-distance view, the identification of players and balls is poor under complex lighting conditions.

Deep learning forms more abstract high-level features by combining low-level features to discover distributed features of data. The multilayer network structure of deep model can make the network learn the organization form of features by itself [2], and get the final semantic features through multiple abstractions. In 2006, Hinton et al. proposed the first feasible depth model. Since then, deep learning has become a new research field of machine learning, known as a revolutionary new technology in the field of artificial intelligence. Deep learning constructs multilayer network model and combines low-level features to form high-level semantic features with abstract representation, so as to simulate the way of thinking of human brain for perception and recognition. At present, deep learning has been widely used in speech, image, and other data recognition, detection and other fields, and has achieved remarkable results. Following are the main contributions of the study:(i)To study the sports training video classification model based on deep learning(ii)Establish the sports training video classification model by using convolution neural network of deep learning method(iii)To verify the effectiveness of the proposed approach through experiments

#### 2. Materials and Methods

##### 2.1. Camera Calibration

Camera calibration technology is used to restore the position of the target in the real three-dimensional space. On this basis, the radial distortion and tangential distortion in the nonlinear model are fully considered, the Rodrigues rotation equation is used to reduce the number of optimization parameters, and the steepest descent method and LM optimization method are used to solve the accurate parameters, respectively.

Because the actual lens in the video is not ideal perspective imaging, with varying degrees of distortion, this kind of distortion can be divided into radial distortion and tangential distortion [3]. In order to describe the imaging model accurately, two parameters are used to describe the lens radial distortion and tangential distortion. The relationship between ideal coordinates and distortion parameters is as follows:

In (1), is the normalized image coordinate calculated by the pinhole camera model; is the image coordinate actually containing distortion; and are the nonlinear distortion values; ; , , , , and are the nonlinear distortion parameters, where , , and are the radial distortion coefficients, which will cause the radial movement of real image points on the image plane; and are the tangential distortion coefficients.

Given the initial parameters, to solve the precise camera parameters is essentially to solve the unconstrained multidimensional extremum problem. Because there is a deviation between the theoretical value of pixel coordinates and the measured value after the target feature points are projected to the image plane [4–6], the optimal estimation of camera parameters needs to meet the minimum deviation. According to the nonlinear optimization theory, the objective function is expressed as follows:

In (2), is the number of target images captured by the camera under different viewing angles; is the number of target feature points; is the observed value of the coordinate of the -th feature point of the -th target image; is the theoretical value of the projection point coordinate of the target feature point under the nonlinear model; is the spatial coordinate of the -th feature point on the target.

In the process of capturing the target from different angles, the internal parameters of the camera are regarded as constant, and the external parameters are different from each shooting angle. The number of optimized parameters increases significantly with the increase of the target image [7–9]. Rodrigues rotation equation provides a method of using vector to represent rotation. If the 3 × 3 rotation matrix with 9 elements is represented by 3 elements of a vector , the external parameters of each image are reduced to 6, which greatly reduces the amount of calculation in the optimization process.

The relationship between rotation matrix and rotation vector is as follows:

The steepest descent method searches along the negative gradient direction of the objective function until it reaches the lowest point of the objective function. For unimodal function, it can quickly get the extreme point. This method uses the principle that the function value along the negative gradient direction of the initial point decreases continuously to search. For the initial point of function , there are sequences , , and , which satisfies the relationship as follows:

The corresponding function values have the following relations:

Because the objective function has the form of minimal sum of squares and the coordinates of feature points on the target image are nonlinear functions of parameters to be estimated, it belongs to nonlinear least squares optimization problem. LM method can avoid the case that is ill-conditioned matrix in least squares. In LM algorithm, the descent direction is given by the following equation:

Through the above process to restore the position of the target in the real three-dimensional space, the accuracy of sports training video classification is improved.

##### 2.2. Video Preprocessing

Before classifying the sports training videos, it needs to firstly preprocess the sports training videos. Shooting video on sports training site is usually divided into distance video, medium distance video, and close distance video [10]. The proportion of sports training remote shooting is relatively large; remote shooting can effectively obtain the whole field information. is used to represent video input, where represents the video segment corresponding to a specific sports event, represents the video image of frame , and , indicates the number of frames converted into video frame image of the input video segment.

In order to classify sports training videos more accurately, the input video segments are segmented according to equal length [11], and several subvideo segments are obtained. The expression is as follows:

In the above equation, , , , . represents the -th subvideo segment after video segmentation, represents the -th frame image in the -th subvideo segment, and represents the number of subvideo segments. After the above processing, the input and segmentation of the sports training video are completed, and the time span of the segmented video field has a certain impact on the classification results.

##### 2.3. Feature Extraction

###### 2.3.1. Extraction of Motion Vector Field

(1)Let the size of the sports training video be , denote the resolution, and denote the length of the video sequence. The video is divided into blocks; each block size is , where and denotes the number of blocks in each block.(2)A rectangular coordinate system is established and the motion vector is mapped to this coordinate system [12]. The mapping diagram of the motion vector field of the rectangular coordinate system is shown in Figure 1. In Figure 1, is the block with position , is the direction of the motion vector . If is the component of the motion vector of the -th block in the horizontal direction, is the component of the motion vector of the -th block in the vertical direction, and is the motion intensity of the block ; then,(3)The coordinate system of continuous video frames is arranged in chronological order [13], and it is divided into equal angle sectors along the positive direction, is quantized to intervals, and then the histograms of and are made, respectively, so it can obtain In (9), represents the number of motion vectors in quadrant in frame , and represents the number of quantized to in frame .(4)The expectation and variance of the motion vector in the and directions are used to evaluate the motion in the block, namely,In (11), and represent the components of the motion vector of the -th macroblock in the and directions in a frame, and , , , and represent the expectation and variance of the motion vector of the macroblock in the and directions, respectively.

###### 2.3.2. Extraction of Luminance Feature

Assuming that the frame resolution is , each frame is divided into blocks, and the size of each block is , where , , represents the brightness value of the -th pixel in the block, and the average brightness value of each block is , , namely,

If is used to represent the encoding value of the block luminance comparison, the encoding value of the luminance comparison result between the -th block and the -th block in the frame can be expressed by (12), where and .

Through (12), the frames can be compared according to the average brightness of blocks and encoded with “1” and “0”.

###### 2.3.3. Color Feature Extraction

Assuming that the frame size is , the frame is converted into HSV model and divided into blocks; each block size is , where , . represents the pixel value of the component of the -th pixel in the -th block of the video, where , , and ; then, the color characteristics of the sports training video are as follows:

In the above equation, , , and respectively represent the mean value, variance, and third-order moment of component in the -th block.

###### 2.3.4. Texture Feature Extraction

Let have gray levels in sports training video. denotes a gray level cooccurrence matrix, and its element is the times of pixel pairs with gray level and gray level in . is calculated as follows:where is the gray level of the pixel , and and reflect the distance and direction between the two points.

The most commonly used texture feature is used as the classification feature of sports video. The definition is as follows:

##### 2.4. Sports Training Video Classification Model Based on Convolutional Neural Network

###### 2.4.1. Neuron Layer Structure of Convolutional Neural Network

A convolutional neural network usually consists of multiple convolution layers, down sampling layers, and normalization layers. Finally, the two-dimensional feature map is connected into a vector and input to the final classifier through the fully connected layer to get the probability output.

*(1) Convolution Layer*. In a convolution layer, the features of the upper layer are convoluted by a learnable convolution kernel, and then the output features can be obtained through an activation function [14]. Each output may be combined to convolute the values of multiple inputs:

In the above equation, represents the set of input features connected by a convolution kernel. determines the connection between convolution kernel and input layer. The output feature map is obtained by convolution kernel of input feature map. Assuming that each convolution kernel extracts a pattern, each output feature map corresponds to a feature and each convolution kernel is equivalent to a feature map. This is because the convolution layer uses weight sharing technology; that is, each neuron uses the same convolution check input to do convolution and each neuron is only connected with some input neurons, which reduces the number of convolution layer parameters. Function is the activation function of neurons, which is usually a nonlinear function.

The input of convolution layer is multiple two-dimensional planes, and each convolution core is connected with all input channels [15]. Convolution is performed in a three-dimensional space to obtain the position response output. Finally, the convolution checks the convolution of the whole input space to obtain a feature map. Usually, multiple convolution kernels are set in each convolution layer, and each convolution kernel extracts different features, so that each feature map represents the feature plane extracted by the corresponding convolution kernel.

*(2) Down Sampling Layer*. The purpose of the down sampling layer is to improve the robustness of the network to the small deformation of the input samples, so as to enhance the generalization performance of the network. is used to represent the output of a neuron in the down sampling layer. The down sampling layer can be expressed as where is the normalized weighted window, which can make down sampling of every input feature map without crossing different feature maps. The number of output feature maps in the down sampling layer is the same as the number of input feature maps, which reduces the resolution of each feature map.

*(3) Normalization Layer*. The normalization layer is very important for improving the performance of neural network. In convolution neural network model, the normalization layer includes the normalization of the feature vector of the same feature map and the feature map located in different feature maps, which strengthens the feature map with higher response value, and drives different convolution kernels to learn different patterns [16, 17]. The subtraction and normalization operation at a given location are actually the value of the location minus the weighted value of each pixel in the neighborhood. The weight can be determined by a Gaussian weighted window. Division normalization is a common normalization algorithm, which can intensify the difference of response value and improve the effect of high characteristic of response value.

Local response normalization is a common normalization algorithm in convolutional networks. The response value can be expressed aswhere represents the value of the -th input feature map at the coordinate ; represents the number of input feature maps; represents the normalization on the adjacent maps.

The local response normalization layer contains three adjustable parameters, namely, the number of feature maps and parameters and . All normalization layers adopt the same parameter setting, such that , , .

*(4) Fully Connected Layer*. The fully connected layer is usually at the top of the neural network, which forms a traditional multilayer perceptual network together with the decision-making layer to classify the features extracted from the convolution layer. The overfitting of convolutional neural network is mainly caused by more parameters in the fully connected layer. Dropout technology is usually added to the fully connected layer, and some neurons are randomly selected to participate in the training to prevent the network from overfitting.

A multilayer convolutional neural network is composed of the above five neuron layers, which perform different functions, respectively, and must be combined according to certain rules to achieve better results. Among the five neuron layers, only the convolution layer and the fully connected layer contain trainable parameters, and the convolution layer can retain the input spatial position information, which is required by the down sampling layer. The convolution layer is usually used alternately with the down sampling layer, so that different convolution layers can extract different scale features [18]. The fully connected layer will destroy the position information of feature planes and the difference between each feature plane. The fully connected layer is usually used as a part of the final multilayer perceptual classifier, which integrates the convolution layer and the down sampling layer to extract features and send them to the decision layer for classification.

###### 2.4.2. Structure of Improved Convolutional Neural Network

The AlexNet convolutional neural network of deep learning is used to classify sports training videos. The AlexNet convolutional neural network consists of 23 layers, including five convolution layers and three fully connected layers.

*(1) Use the New Activation Function ReLU*. Generally, the activation function of artificial neuron is hyperbolic tangent function or sigmoid function . In the experiment, it is found that when sigmoid or hyperbolic tangent function is used to calculate the error gradient by backpropagation, the derivation involves division, which leads to a large amount of calculation; once the number of layers of traditional neural network increases, the gradient fading problem occurs. The root cause is that when sigmoid or hyperbolic tangent function is used to calculate the error gradient by backpropagation, the change of function value slows down, and its derivative is close to zero, which makes other hidden layers far away from the output layer prone to gradient fading [19]; in addition, it is also a disadvantage of sigmoid function to add weight penalty factor to get sparsity and output nonzero mean value.

The advantages of ReLU function are as follows: first, the calculation speed and convergence speed are faster; second, ReLU will make the output 0 when *x* < 0, resulting in network sparsity, reducing the interdependence of parameters, and alleviating the over fitting problem; third, its derivation is piecewise linear in both forward and backward propagation, avoiding the disappearance of gradient.

*(2) Local Response Normalization (LRN)*. In neurobiology, there is a concept called “lateral inhibition”, which refers to the ability of excited neurons to inhibit their adjacent neurons. That is to highlight the maximum peak in the local sensing area and increase the ability of biological perception.

It is in the neural network that the LRN layer realizes “lateral inhibition”. Let be the activation value of neurons at position of the -th kernel function and be the activation value after normalization, and the total number of kernel functions is ; then, the mathematical model of LRN is expressed as follows:where the sum operation is normalized at the adjacent position of around , and the super parameters , , , and need to be determined by the verification set. It is very effective to add LRN layer after using ReLU function as the activation function. The ReLU function has unlimited activation ability when , which needs LRN normalization. It is expected that the LRN layer can detect the features with high frequency and amplify them by suppressing the peripheral neurons; the LRN layer will suppress the uniform response in any given local neighborhood; that is, if all the values are large, then the normalization will suppress all the values uniformly. The purpose of LRN layer is to make useful information more prominent by inhibiting and enhancing neuron output.

##### 2.5. Event Matching

Based on the output of convolutional neural network, the events of sports training test video sequence and reference video sequence are matched by event matching method. Given observation symbols of video class, a multistate traversed convolutional neural network model is trained by using features extracted from sports training video frames, to obtain the event sequence (event probability and corresponding state transition) in the corresponding reference video. The reference event sequence is used to create a dictionary for a given sports training event [20]. For the event with a specific state transition in the reference event, the probability distribution of the event is approximated by a Gaussian density function , where and represent the mean value and variance of the density function, respectively. It is given by the following equation:

Each state transition is assigned a mean value and variance to represent the probability of the event occurring in the category. For the sports training video clips that do not appear in the training stage, a reference convolution neural network model is used to obtain the events. Let denote the event probability of state transition at time when the test sequence in the observation symbol provides a reference model. Let denote the number of observation symbols in the test sequence. The similarity between the test video clip and the reference model is expressed by the following equation:

The similarity value between video clips and all kinds of sports training is compared, and they are classified into the category with the highest similarity value.

#### 3. Results and Discussion

In order to verify the feasibility and effectiveness of the sports training video classification model, eight data sets which are often used in the classification research in the network are selected as the test objects. The data sets include eight types of sports training videos, such as basketball, volleyball, and football. The detailed contents of the videos in each data set are shown in Table 1.

Table 1 shows that the experimental data set contains many types of sports training videos. Different sizes and types of sports training videos are used to test the classification performance of different models of sports training videos. Support vector machine model and HMM model are selected as comparison models.

Three models are used to classify the sports training videos of 8 data sets, and the classification results are shown in Table 2.

The experimental results in Table 2 show that the classification of sports training videos can be realized by using the proposed model. The classification results of sports training videos by using the proposed model are similar to those of actual sports training videos, which indicates that this model has high classification performance of sports training videos.

In the result of sports training video classification of the proposed model, two images are randomly intercepted in basketball training video, as shown in Figure 2.

**(a)**

**(b)**

As can be seen from the experimental results in Figure 2, using the proposed model to classify basketball training videos can accurately classify videos according to the extracted features of sports training videos, and randomly intercepted pictures are all accurate for basketball training, which verifies that the proposed model has high classification effectiveness of sports training videos.

The classification accuracy, recall rate, and precision rate are selected as the important indexes to evaluate the classification performance of the proposed model. is used to represent the number of correct recognition results, is used to represent the number of wrong recognition results, and is used to represent the number of failed recognition results. In order to effectively reduce the error caused by a single experiment, the average value of five experiments is selected, and 2 : 1 ratio is set to randomly divide training samples and test samples. The evaluation index equation is as follows:

Statistics of the accuracy comparison results of different data sets and different types of sports training video classification are shown in Figure 3.

**(a)**

**(b)**

As can be seen from the experimental results in Figure 3, under different data sets and different types of sports training, the classification accuracy of sports training video classified by the proposed model is higher than 99%, and the classification accuracy of sports training video classified by this model is significantly higher than that of the other two models, which effectively verifies that this model has higher classification accuracy of sports training video.

Statistics of the recall rate comparison results of different data sets and different types of sports training video classification are shown in Figure 4.

**(a)**

**(b)**

As can be seen from the experimental results in Figure 4, under different data sets and different types of sports training, the recall rate of sports training videos classified by the proposed model is higher than 98.5%, and the recall rate of sports training videos classified by this model is significantly higher than that of the other two models, which verifies that this model has higher classification accuracy of sports training videos.

Statistics of the precision rate comparison results of different data sets and different types of sports training video classification are shown in Figure 5.

**(a)**

**(b)**

As can be seen from the experimental results in Figure 5, under different data sets and different types of sports training, the precision rate of sports training video classification using the proposed model is higher than 98%, and the precision rate of sports training video classification using the proposed model is significantly higher than that of the other two models, which verifies the high accuracy of sports training video classification using the proposed model.

The analysis of the above experimental results shows that the classification accuracy, recall rate, and precision rate of different data sets and different types of sports training videos are the best. Basketball and football sports training videos have strong continuity and change more frequently, so more quantitative features are needed to better obtain the change features in videos. Basketball and volleyball videos usually have close range images, while baseball and tennis videos are mostly shot from a long-distance perspective, so feature extraction is difficult. Football is also shot from a long-distance perspective; it has continuous movement in the field and can be well collected by increasing the number of states. In football and basketball videos, it most uses a single camera to track players or regions of interest; unlike other sports training, switching between multiple cameras frequently is conducive to event detection. The model can effectively improve the randomness of target movement and improve the classification accuracy by extracting video features.

The training time and test time of sports training videos classified by three models with different data sets are counted. The comparison results are shown in Table 3.

The experimental results in Table 3 show that the classification speed of sports training video using the proposed model is the fastest, and the accurate classification results of sports training video can be obtained by using shorter training time and test time of the proposed model, which verifies that this model has higher classification efficiency of sports training video.

The above experimental results show that the proposed model can accurately classify all kinds of sports training videos, which shows that this model has good classification performance. The main reason is that this model uses deep learning model to establish classification model, which can effectively improve the classification accuracy of sports training videos. For close range videos with similar categories, it still has high classification accuracy. This model has high accuracy and comprehensive performance in classifying all kinds of sports training videos.

#### 4. Conclusion

At present and with the passage of time, the amount of sports training video data in the Internet is growing rapidly. In order to effectively manage and retrieve sports training video, accurate classification of sports training video is very important for consideration. Aiming at the shortcomings of existing approaches of sports training video classification, this paper establishes sports training video classification model based on deep learning method. Convolution neural network with deep learning is used for the classification purpose in the proposed research. After classification, event matching operation is performed, and video classification is realized according to similarity. The experimental results show that the proposed model can effectively determine all kinds of sports training videos and accurately detect the occurrence of events through convolution neural network, so as to achieve high-precision classification of sports training videos. Compared with other models, the proposed model has the advantages of simple implementation, fast processing speed, high classification accuracy, high generalization ability, and adaptability.

#### Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

#### Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.