#### Abstract

After the introduction of online learning mechanism, the traditional target tracking algorithm in football game video based on TLD has good tracking ability, but it will lose the target when the target is seriously obscured. Therefore, a soccer game video target tracking algorithm based on deep learning is proposed. The target detection algorithm of GoogLeNet-LSTM is used for faltung to obtain the feature mapping array. After processing, a high reliability candidate box for training and matching is obtained, and the feature maps of the detection results are collected to obtain the depth features required for tracking. Scale space discriminant tracking algorithm and Markov Monte Carlo algorithm are used to track single target or multi-target, respectively. Experimental results show that the average frame rate of the algorithm is maintained above 35 Hz, and the tracking time is about 12.5 s. The average center position deviation index is 39, the average coverage index is 40, and the resource utilization is low. The algorithm can track the target in the football game video well.

#### 1. Introduction

In modern life, sports video is the important video that is popular among the majority of viewers. It has a large proportion in existing TV programs and the Internet [1, 2]. With the continuous improvement of people’s quality of life and the rapid advancement of technology, people’s demands for sports video are also rising. In terms of sports competition viewing, passive and flat viewing methods will gradually fail to meet the requirements of TV viewers [3–5]. The broadcasters need to add various visual effects to meet the visual requirements of the audience. In terms of game analysis, the team coach needs to extract relevant data from the football game video to assist the coach in researching the tactics. In terms of commercial applications, the broadcasters also need to more fully explore the commercial value of the football game broadcast. These need to analyze the video data of the game and process the game images for different requirements in order to meet the requirements of the game video [6–8]. Among the many sports competitions, football matches have the largest number of viewers and the highest level of attention. Therefore, the detection, extraction, location, and tracking of moving targets in the video have high practical and practical significance.

The extraction and tracking of targets in the football game video are hot spots in the current sports field image and video processing. The technology required covers many areas of image processing analysis and computer vision. In general, football match video scene consists of background and goal, where the goal is the important part of the video, which contains important information. Therefore, quickly and efficiently segmenting objects in video and tracking the target of interest are the bases for subsequent image analysis [9].

Although the research of target tracking has made great progress and breakthrough to some extent, the robust target tracking algorithm has been full of challenges due to the complexity of the environment and the influence of target deformation. The core problem of target tracking is feature representation. The early features are manually selected, and the appropriate features are selected according to the different application scenarios, but the effect is far from meeting the actual needs. Since the advent of deep learning technology, the field of computer vision has developed rapidly, and deep learning techniques were first used for image classification problems [10–12]. In recent years, the multi-target tracking algorithm based on deep learning has also made some breakthroughs. Multi-target tracking is the very challenging research direction in the field of computer vision and has a wide range of practical applications, for example, intelligent video monitoring control, abnormal behavior analysis, and mobile robot research. Traditional multi-target tracking algorithms tend to have poor tracking performance due to poor target detection. Depth-based learning detectors can achieve better results, which in turn improves tracking accuracy. Therefore, how to achieve effective combination of target tracking and deep learning has become the focus of researchers.

Traditional target tracking algorithms have many problems. For example, the block-based scale-adaptive CSK rigid body target tracking algorithm proposed in Document 4 does not consider the confidence of the candidate box, and the tracking result has low precision. The KCF (kernelized correlation filter)-based tracking algorithm in Document 5 is only applicable to single-target tracking and has limited limitations. The TLD tracking algorithm proposed in the literature 6 causes the target to be lost when the target is severely occluded. The new target tracking algorithm that combines SIFT (scale-invariant features) and compression features is proposed in Document 7. This algorithm has poor effect on feature extraction, which results in lower center position error and coverage, and higher resource occupancy rate of the algorithm. Deep learning is a new research direction in the field of machine learning. It is introduced into machine learning to make it closer to the original goal—artificial intelligence (AI). Deep learning is the internal law and representation level of learning sample data. The information obtained in the learning process is very helpful to the interpretation of data such as text, image, and sound. Its ultimate goal is to make the machine have the ability of analysis and learning like human beings, and can recognize characters, images, sounds, and other data. Deep learning is a complex machine learning algorithm, which has achieved far more results in speech and image recognition than previous related technologies.

In response to the above problems, the target tracking algorithm in the football match video based on deep learning is proposed in this paper. GoogLeNet is used to perform convolution to obtain the feature map array through target detection algorithm-based GoogLeNet + LSTM. After processing, candidate boxes of high confidence used to perform training and matching are obtained to achieve target detection. The feature map of the detection result is pooled to obtain the depth feature required for tracking. According to this feature, the discriminant scale space tracking algorithm and the Markov Monte Carlo algorithm are used to achieve single-target or multi-target tracking.

#### 2. Materials and Methods

##### 2.1. Target Detection-Based GoogLeNet + LSTM

Target detection is the basis of the multi-target tracking algorithm based on data association. The GoogLeNet + LSTM framework is used for target detection for problems such as small targets and occlusions in football video. First use GoogLeNet for convolution. In the last layer, the 1 × 1024 × 15 × 20 feature map array is obtained and transposed into 300 × 1024 feature map array. Each 1024-dimensional vector corresponds to the area of 139 × 139 in the original picture.

The 300 × 1024 feature map array is obtained by GoogLeNet convolution, and then, each 1024-dimensional vector is processed in parallel by the LSTM sub-module. The hidden state of each output goes through two different fully connected layers: the position and width of direct output box, and the confidence of this box through the softmax layer. The LSTM sub-module has a total of five such units; that is, each input can predict 5 boxes and confidences. In the training, the frame is concentrated at the 64 × 64 position in the center of the sensing area, and the confidence is ranked from high to low.

After processing, five detection frames corresponding to 64 × 64 small blocks in the original image and confidence levels can be obtained. The processing of the sub-module needs to filter all the detection frames of the video frame and then remove the frame with low confidence by the given thresholds. Finally, detection result can be obtained [13–17].

The specific process is as follows: if the candidate box intersects with the determined frame, the candidate box is removed. A determined box removes at most one candidate box. In the above matching, the cost is expressed as , means whether the two intersect, the value is , and is the Manhattan distance between the two boxes. The importance of is greater than that of ; that is, the result obtained by the two matching schemes is first compared with the size of *m*. If the conclusion cannot be reached, the size of *d* is compared. The Hungarian algorithm is used to find the least costly match [18–20]. Assume that the filter’s confidence threshold is 0.5, then those boxes with the confidence below 0.5 are removed.

In order to effectively train the target detection model, the following training method is adopted: more candidate boxes are obtained in the LSTM sub-module [21, 22], but there are detection errors.

There are three types of errors [23, 24]:(1)Point out the place that is not the tip of the person.(2)The difference between the predicted and the true value frame positions.(3)Multiple prediction frames are generated for the same target.

The lower confidence level is assigned to the candidate box to prevent the occurrence of Case 1. The error is corrected to avoid the occurrence of Case 2; the lower confidence is given to the prediction boxes generated by the same target to eliminate the problem of Case 3.

The loss function of the model training is where is the true value of the box, is the candidate box, is the matching algorithm, means the th box in the truth box, means the th box in the to-be-selected box, is the Manhattan distance between the two, and is the cross-entropy loss, which is the softmax loss in the corresponding network [25–27]. The first term of this loss function represents the position error of the candidate box and the matching truth box, the latter represents the confidence of the candidate box, and adjusts the balance between the two losses.

The matching algorithm is the Hungarian algorithm, and the comparison function used iswhere is . If the center of the to-be-selected box falls in the true value box, it is 0; otherwise it is 1. is the sequence number generated for the candidate box. The goal is to have high confidence frame first generated when matching [28–30]. Therefore, when matching the same target, the lower the ranking is, the lower the cost is [31, 32]; is the distance between the two boxes, that is, the distance error.

The target detection results in the football match video are obtained by the detection algorithm-based GoogLeNet + LSTM. Based on this, deep learning is used to extract depth features.

##### 2.2. Extraction of Depth Features

The corresponding box position obtained in the upper section is the position of the human head in the video, which is enlarged by certain scale to cover the whole body [33]. After obtaining the position and size of the target frame, the feature map array obtained by the last layer convolution of GoogleNet is used to extract features. The depth features of each target detected can be obtained by pooling the feature map, because each feature is highly abstract and can well characterize the appearance characteristics of the target in the football match video.

The feature of the proposed algorithm is that the feature map is used for pooling to obtain the depth features required for target tracking without re-training. Therefore, under the premise that the real-time performance of the target tracking algorithm is unchanged, the accuracy of target tracking in the football game video is improved.

##### 2.3. Single-Target Detection-Based DSST

The discriminant scale spatial tracking algorithm is simply referred to as DSST tracking algorithm. After obtaining the depth features, the DSST algorithm is used to track the single target in the video. DSST combines the two-dimensional positional filter with one-dimensional scaled filter. The candidate position is first determined by using the two-dimensional position correlation filter, and this area is used as reference area for the one-dimensional scale filter calculation. In this way, candidate blocks of different scales are obtained, and the scale with high matching degree is searched. The principle of scale selection is as follows:where and are the width and height of the target in the previous frame; is the setting factor, and its value is set as 1.02; and indicates the number of scales, with the setting of 33. The scale in the formula is not linear relationship, but only the detection process from fine to coarse and from inside to outside.

In extracting image features and generating filters, MOSSE correlation filters are employed [34]. In this way, a series of image blocks are extracted from the target as training samples, which are, respectively, recorded as . The corresponding filter response values are Gaussian functions, which are, respectively, recorded as . The peak is at the center, and the end result is to find a filter that meets the minimum mean square error.

The MOSSE optimal correlation filter formula iswhere is Gaussian function, represents complex conjugate, and means the minimum value of the filter. The right equal sign is derived from the Parseval theorem. The right side of the equal sign is the frequency domain equation, and the left side is the airspace equation. This calculation can be used to transform the problem from spatial domain solution to frequency domain solution. In the frequency domain, the minimum value of the filter is as follows:

After the correlation filter is obtained, the determination of the target position of the next frame is determined by the functional response of the correlation score [35]. The area with high response value is the new target position, and the response formula is

In this algorithm, (4) uses the extracted depth feature and Gaussian function to obtain the correlation filter . *t* indicates the response time. When a new frame is input, the feature extracted by the image block is used as an input to calculate with the correlation filter using (5), and the response score is obtained to get the candidate target [36].

DSST designs the input of the image into the feature vector of the dimension. The input signal represents a certain image block of the input image. The optimal correlation filter is established by the MOSSE idea. The formula is as follows:where is one dimension of the feature and is regular coefficient, and the solution of the obtained minimum value is as follows:

Since the pixel points in the image block directly solve the -dimensional linear equation, the calculation amount is too large and time consuming. Therefore, a robust approximate solution is obtained by updating the numerator and the denominator in the above equation. The formula is rewritten as follows:where is learning rate. The position of the target in the new frame, that is, the maximum response value of the correlation filter, can be obtained by

The DSST algorithm uses the dual correlation filter to track the single target in football match video. The algorithm is more portable and efficient, but the problem still exists. When tracking multiple targets, the occlusion of the target will inevitably reduce the accuracy of the tracking, and the interference discrimination of similar targets is not strong. Therefore, the Markov Monte Carlo (HDDMCMC) algorithm is adopted when tracking multiple targets in the football match video.

##### 2.4. Multi-Target Tracking Based on HDDMCMC Algorithm

###### 2.4.1. MCMC Algorithm in the Segment

In the multi-target tracking algorithm, considering the stability and continuity of motion (the same target in the front and rear frame video data), the appearance characteristics will not change drastically.

In the intra-segment MCMC algorithm, the depth feature in Section 2.2 is used to measure the similarity of the target trajectory.

Each detection target is treated as a node and is described by the intra-segment time . Suppose the set of nodes of the video frame in *t* is , and the posterior probability is as where and represent the nth and *n* + 1th nodes in the kth pedestrian trajectory, respectively. is the similarity of two nodes, and it can be calculated by the cosine of the angle of the two-node depth feature. is the length of different tracks, and represents the number of false alarms to ensure that the false alarm rate is low.

###### 2.4.2. Inter-Segment MCMC Algorithm

The data used are mainly the target trajectory generated by the MCMC algorithm in the segment. The main actions taken by the algorithm include fusion, splitting, and switching operations. After passing the intra-segment MCMC algorithm, many more reliable target trajectories are generated. At this time, if there is a case where the same target trajectory is broken, it is caused by unstable detection data. Therefore, the purpose of the inter-segment MCMC is to further combine the target trajectory data of the two time periods. In the current state, the posterior probability is updated as follows:where false alarm factors are no longer considered because they are mainly used to divide the target trajectory [37]. For fusion operations, the allowed time interval is set as and the frame difference at the junction between the track segments of the two targets cannot exceed 6. The standard deviation of the probability is set to . In this way, the unit that can be transferred between different states is a relatively complete target trajectory segment that has been generated previously [38–40].

The inter-segment MCMC algorithm moves the target that has gone out of the video scene out of the current data set. The current data set is assumed to be . After the MCMC gets the trajectory in the next segment, it is matched by the inter-segment MCMC algorithm. That is to continue to build on the previous target data, combined with the current target data, to further data integration to optimize. The entire algorithm is continuously performed in such a sliding manner.

#### 3. Results

In order to verify the superiority of each aspect of the proposed algorithm, it is compared with some traditional algorithms such as CSK, KCF, Struck, CT, and TLD. The experimental object is the video of football match. The threshold set by the accuracy is 20 pixels, and the threshold of the success rate is set as . The results are shown in Tables 1 and 2.

By comparison, the accuracy of the algorithm is up to 0.88, and the success rate is up to 0.81. Although the effect of this algorithm is not optimal for some video sequences, the algorithm is robust to the overall performance. When the target is partially occluded, the algorithm can still accurately track the target.

The center position error refers to the center deviation of the tracking frame from the real target frame, and the coverage ratio is the proportion of the intersection of the tracking frame and the real target frame in the merged portion. In order to evaluate the tracking performance of different algorithms on the entire video series, the experiment will use the average center position deviation and average coverage as indicators to test, and the results are shown in Tables 3 and 4.

It can be seen from Tables 3 and 4 that among the eight tracking video sequences, the average center position deviation index of the algorithm has three groups of best and two groups of two; the average coverage has two groups of best and four groups of second. The experiment used the scoring method to evaluate the two indicators separately. The rule is to sort the two indicators according to their performance from high to low, and then score them in 6, 5, 4, 3, 2, and 1. Each video sequence is scored in turn, and finally, they are summed and used as their final result, as shown in Figure 1.

Analysis of Figure 1 shows that the algorithm scores 39 points on the average center position deviation and 40 points on the average coverage rate, which are better than other algorithms. This shows that the algorithm is better in the listed tracking algorithms.

The video frame rate is a measure used to measure the number of displayed frames and reflects the smoothness of the tracking results. The average frame rate of the eight video sequences is compared using different algorithms, and the results are shown in Table 5.

Analysis Table 5 can be obtained that the average frame rate of the algorithm in this paper is higher than other algorithms, both above 35 Hz. This shows that the tracking results using the algorithm of this paper are more fluent.

In order to verify the efficiency of the algorithm, the iterations and time consumption of different algorithms are compared, respectively, and the results are described in Figure 2.

**(a)**

**(b)**

It can be seen from Figure 2 that the iteration number of the algorithm is similar to the number of CT algorithms, with an average of 1–2 times, which is significantly lower than the CSK algorithm. In the comparison of time consumption, the time consumed by the algorithm is similar to the CSK algorithm, with an average of 12.5 s, which is significantly lower than the CT algorithm. The comparison results show that the tracking efficiency of the algorithm is higher.

To verify the stability of the proposed algorithm, three algorithms are used to track the target in the same experimental environment. The average outage probability of different algorithms is compared, and the results are shown in Table 6.

In order to more clearly show the stability of the algorithms, the data in Table 6 are described by the implementation of the line graph, as shown in Figure 3.

Analysis of Table 6 and Figure 3 shows that the average outage probability of the algorithm is 0.2371, which is lower than the other two algorithms. The experimental results show that the proposed algorithm has better stability when tracking the target in the football match video.

In order to test the resource occupancy rate of the proposed algorithm, it compares and analyzes the target detection, feature extraction, single-target tracking, and multi-target tracking. The results are shown in Table 7.

It can be seen from Table 7 that the CPU and memory usage of the proposed algorithm are 26%–34% and 7%–17%, respectively. The usages of the CSK algorithm are 64%–70% and 33%–41%, respectively. Those of the CT algorithm are 64%–72% and 29%–39%, respectively. The experimental results show that compared with the other two algorithms, the algorithm of this paper tracks the resource occupancy rate of the target in the football match video which is low.

#### 4. Discussion

The accuracy and success rate of the algorithm are as high as 0.88 and 0.81, respectively; the average center position deviation index and the average coverage index are 39 and 40 points, respectively; the average frame rate is maintained above 35 Hz, and the tracking time is about 12.5 s. These data show that the proposed algorithm outperforms other algorithms in performing target tracking. The reason is the use of deep learning techniques in this paper. Deep learning is artificial neural network that simulates the human brain’s analysis of things. By simulating the human brain to acquire data and parse it, this structure can better learn the essential characteristics of objects. The main ideas of deep learning target tracking are as follows: first, construct a deep learning model to train standard data sets and obtain more accurate target feature information. Then, use this model for target matching and positioning to achieve efficient target tracking. Depth features can more accurately reflect the appearance characteristics of moving objects than traditional features such as scale-invariant features (SIFT). Therefore, the algorithm of this paper greatly improves the accuracy of target tracking. At the same time, it combines the discriminant scale space algorithm and the Markov Monte Carlo algorithm to track the targets in the football video to ensure efficient tracking of single targets while achieving accurate tracking of multiple targets.

#### 5. Conclusions

With the rapid development of computer technology, deep learning has become a big weapon for video target tracking. Deep learning technology has the advantages of high precision, wide application range, and strong stability. Aiming at the traditional target tracking algorithm in the football match video, there is the defect that the target will be lost when the target is severely occluded. A target tracking algorithm based on deep learning football game video is proposed. Combined with deep learning and target tracking technology to track the target, the experimental results show that the tracking success rate, required time, and average frame rate of the proposed algorithm are 0.81, 12.5 s, and 35 Hz, respectively. The average center position deviation index and the average coverage index are 39 and 40 points, respectively, and the resource occupancy rate is low. This shows that the algorithm can well track the targets in the football match video. In the future research, the information technology model is used to further improve the accuracy of football game video target tracking and reduce the defect of losing the target when the target is seriously blocked [41–45].

#### Data Availability

The data sets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

#### Conflicts of Interest

The authors declare that this article is free of conflicts of interest.