Abstract

Deep learning has brought revolutionary progress to computer vision, so intelligent inspection equipment based on computer vision has developed rapidly. However, due to the large number of existing deep features, it is difficult to deploy it on mobile devices to achieve real-time tracking speed. This paper presents a target-aware deep feature compression for power intelligent inspection tracking. First, a negative balance loss function is designed to mine channel features suitable for the current inspection scene by shrinking the contribution of pure background negative samples and enhancing the impact of difficult negative samples. Based on this, the deep feature compression model is combined with Siamese tracking framework to achieve real-time and robust tracking. Finally, we evaluate the proposed method on real application scenarios and general data to prove the practicability of the proposed method.

1. Introduction

A stable and reliable power system is the key to ensuring people’s livelihood and economic development [1, 2]. In recent years, deep learning has been widely used in power systems due to its powerful ability, including voltage stability assessment [3], wind farm control [4], solar power generation system control [5], intelligent inspection of power [68], and so on. In the intelligent inspection of electric power, the application of vision-based inspection technology in electric power inspection has been widely concerned. Robot-based detection technology [6] and UAV-based detection technology [7, 8] both need to track the determined target, so the robust tracking method is of great importance. Deep learning provides strong support for the development of more robust target tracking [9, 10]. However, the deep features provided by the deep neural network model have more parameters, so it is very expensive in computing and storage, which makes it difficult to meet the real-time requirements of intelligent inspection and difficult to deploy on mobile intelligent inspection devices.

Some methods hope to improve machine learning performance by using optimization algorithms, so as to monitor power system state more effectively and quickly [11]. In addition, many lightweight methods for deep networks have been proposed, mainly including five methods: (1) parameter pruning, (2) parameter sharing, (3) low-rank decomposition, (4) designing compact convolutional filters, and (5) knowledge distillation. Parameter pruning mainly removes redundant parameters by designing criteria to judge whether the parameters are important or not [12, 13]. Parameter sharing mainly explores the redundancy of model parameters and uses Hash or quantization techniques to compress weights [14, 15]. Low rank decomposition uses matrix or tensor decomposition techniques to estimate and decompose the original convolution kernel in the deep model [16, 17]. The designing compact convolutional filters mainly reduces the storage and computational complexity of the model by designing special structured convolution kernel or compact convolution computation unit [18, 19]. Knowledge distillation mainly utilizes the knowledge of large networks and migrates its knowledge into the compact distillation model [20, 21]. These methods tend to compress from the perspective of the whole network, but for tracking, only the feature output of a certain layer is used. So how to compress the deep feature is very important to improve the tracking timeliness.

In order to compress features, Li et al. [22] proposed a BinJaya-based feature selection method to select the optimal feature subset from the whole feature space composed of a set of systemic-level classification features extracted from PMU data. The Method ECO. Efficient convolution operators for tracking [23] uses principal component analysis to simplify features, which is still very slow with online updates. The method context-aware deep feature compression for high-speed visual tracking (TRACA) [24] combines a context-aware network with multiple expert auto-encoder to compress feature graphs. However, its compression performance mainly depends on the pretrained context-aware network, which cannot well adapt to the current tracking scene. On the contrary, method target-aware deep tracking (TADT) [25] proposed regression loss and sequencing loss. Under the guidance of these two losses, only the target information given in the first frame can be utilized to effectively eliminate redundant channel features and achieve robust and high-speed tracking. It is observed that TADT uses a large number of negative samples of pure background in regression loss learning to occupy more contributions, which leads to the activated channel features paying more attention to the difference between the target and the pure background, while tracking requires more attention to the interference that is indistinguishable from the target.

Inspired by the previously mentioned work, we proposed a target perception deep feature compression method for intelligent detection target tracking. We designed a negative balance loss function to shrink the loss of pure negative samples and prevent them from dominating the learning process. At the same time, the contribution of hard negative samples was enhanced, so that the activated channel features were more focused on the difference between target and similar interference. Then, the compressed feature is combined with Siamese framework to achieve robust and real-time tracking.

2. Regression via Convolution Layer

We first review the learning process of convolutional regression networks. The convolutional regression network aims to return training samples to soft labels, which are usually Gaussian-like. Given an initial image with annotations, a large number of training samples are obtained through intensive sampling to generate the corresponding Gaussian soft label . The weight coefficient of the regression network is estimated by solving the minimization problemwhere is the convolution operation and is the regularization parameter controlling overfitting. The network weight is usually optimized by minimizing the square loss:

Samples for learning convolutional networks can be extracted by sliding a window on the input search area image. The search area must contain a large number of context areas around the target. Because the surrounding background information contains a large area of valuable information, it contributes to the network’s ability to identify targets from the background. However, the large context region contains most of the pure background, with only a small number of target-like regions. These pure background areas increase the number of simple negative samples, resulting in large losses and the loss of valuable and small positive samples being drowned out. Due to the gradient of easy negative samples being dominant, the learning process does not know the difference between the attentional target and similar disturbances; thus, the importance of each channel feature to the representation of the target cannot be assessed more accurately.

3. Negative Balance Loss

One way to solve the imbalance of training data is to design with reduced effects of simple negative samples. Chen et al. [26] proposed an automatic hard negative mining method, which eliminated easy negative and enhanced positive. Song et al. [27] used cost sensitive loss to reduce the influence of easy negative factors on the learning process. Lu et al. [28] proposed shrinkage loss, which penalizes the loss of easy sample while keeping the hard sample unchanged. Bhat et al. [29] proposed discriminant learning loss to enhance the contribution of positive samples. Although these methods can simplify the loss of negative samples to some extent, they can reduce the loss of difficult negative samples.

In this paper, we designed a negative balance loss function, which can more effectively evaluate the effectiveness of each channel feature for the current target representation by shrinking the loss of simple negative samples and enhancing the contribution of difficult negative samples. The designed negative balance loss function is as follows:where is the function used to adjust the loss andwhere and are the hyperparameters. It is worth noting that the proposed loss function introduces two additional parameters, and . Parameter is mainly used to control the degree of loss compression. The larger the loss compression is, the smaller the loss compression is, and the impact of easy negative samples will be greatly reduced. However, it also reduces the loss of hard negative samples. Another parameter can solve this problem well. It is proportional to the degree of loss compression. An S-shaped curve can be formed when an appropriate combination of and is selected, which can reduce the loss of easy negative samples and enhance the loss of difficult negative samples, as shown in Figure 1. Through analysis and experiments in the validation set, we determine and .

4. Intelligent Inspection via Deep Feature Compression

Visual power inspection relies on the expression ability of image features [30, 31]. The emergence of neural networks has brought about powerful deep features. The development of deep neural networks gradually tends to be larger and deeper models, while benefiting from large-scale training data sets such as ImageNet, the performance of neural networks has been greatly improved. However, the neural network has a very high requirement for memory and computing power, so it has been paid more and more attention in academic circles. In practical application, it is a hot research direction to keep its performance and compress it, so that it can run effectively on intelligent inspection equipment.

To solve the previously mentioned problems, we proposed a channel pruning method based on target perception. Our work is based on the following inspiration: the sensitivity of each convolution filter to the current object can be obtained by the gradient obtained by the back propagation network, and then, the importance of each channel feature to the current target representation can be determined by using the global average pooling. Based on this, we used the designed negative balance loss function to guide the assessment of the importance of each channel feature to the target representation. Given a search area image, pretrained CNN is used to extract features, and the importance of the channel feature can be calculated by the following formula:where is the channel index, is the global average pooling function, and is the designed negative balance loss. In order to improve the effectiveness of judging the features of each channel for the current target representation, we use the negative balance loss to represent the problem

Here, we regress all samples aligned with the target center in the image to a soft label, where the soft label is a Gaussian label , is the offset against the target and is the kernel width. The importance of each channel feature to the target representation can be calculated in terms of its contribution to the fitting of the label, that is, taking the derivative of with respect to the input feature . Using the chain rule and (7), calculate the gradient of regression loss:where is the output prediction. Finally, after obtaining the importance weight of each channel feature, we conducted binarization operation, and those whose contribution value was higher than a certain threshold value were retained, while the others were all deleted. The importance weight generation process for each channel is shown in Figure 2.

5. Mobile Intelligent Inspection Tracking

After the intelligent inspection device determines the target to be tracked, the tracing process is started. First, the initial features extracted from the target and its context region are sent to a single-layer network for training. Then, the filter gradient of each channel is calculated according to the convergence loss. Then, the contribution of each channel feature is calculated according to the method given in Section 4 to guide channel pruning. Given the initial feature space is suppressed, the feature space after compression iswhere is the channel pruning function and is the binarized channel importance parameter. In the tracking stage, the initial target and the search area of the next frame are known, and the position of the predicted target in the coordinate system is calculated aswhere is the target template feature, is the search region in the next frame, and is the frame index.

In addition, as the intelligent inspection device or camera moves, the scale of the target will change, so the tracking process needs to adapt to this change. In order to better evaluate scale variation, we added a scale sensitive pruning method based on ranking loss to target sensing channel pruning. We used smoothing to approximate the sorting loss defined by

6. Results and Discussion

6.1. Experimental Setup

In order to verify the effectiveness of the proposed method, we first conducted experiments on a general tracking dataset TC-128 dataset [32], OTB100 [33] and UAV123 [34] and then conducted experiments on the actual inspection video collected. The experiment was carried out on a PC with 316G memory, I7 3.00GHZ CPU, and GTX-2060 GPU, using MatConvNet toolbox [35] to implement the proposed method in Matlab2019a. The backbone network of the tracker was chosen as VGG-16 [36]. In order to obtain better accuracy and scale, the activation output of CONV4-3 and CONV4-1 layers was used as the base deep feature. After calculating the importance of each channel feature, the first 100 important channel features of Conv4-3 layer and the first 30 important channel features of CONV4-1 layer were retained. In the tracking stage, the given target is used as a template to cut out the search area image three times the size of the target from the current frame and then form the feature pyramid of 0.990, 1, and 1.005 times of the target scale, respectively.

6.2. Overall Performance on Common Evaluation Benchmark

We first evaluated the overall performance of the proposed method on two general tracking benchmarks (TC-128 and UAV123) and compared the precision and success rate plots of the tracking results under one-pass evaluation (OPE). The Precision Plot. First compute the center points of bounding box and artificially marked ground-truth objects, whose distance is less than the percentage of video frames with a given threshold. A curve obtained according to different percentages of different threshold values is the accuracy graph, and the precision of 20 pixels is the precision score (DP). Success Plot. First define coincidence rate score (overlap score, OS), the tracking method of bounding box (to a), and groundway to box (to b). The coincidence rate is defined as , where in the said area is the number of pixels. When the OS of a frame is greater than the set threshold, the frame is regarded as success, and the percentage of the total number of successful frames in all frames is the success rate. A curve drawn according to the value range of OS is the success plot. Then, calculate the value of the area under the success plot, which is the AUC score.

6.2.1. Overall Performance on the TC-128 Dataset

We mainly compared the proposed method with the correlation filters based trackers, including method exploiting the circulant structure of tracking-by-detection with kernels (CSK) [37], method accurate scale estimation for robust visual tracking (DSST) [38], method high-speed tracking with kernelized correlation filters (KCF) [39], the deep feature-based methods include method hierarchical convolutional features for visual tracking (CF2) [40], ECO [23], TADT [25], and other trackers including method convolutional regression for visual tracking (CREST) [41]. The tracking results on the TC-128 dataset are shown in Figure 3. Through comparison, it can be found that the deep feature-based method is significantly better than the manual feature-based method. Compared with deep feature-based methods, our method also has greater advantages. It is worth noting that, compared with the traditional ridge regression used by TADT, the proposed method can effectively solve the problem of easy and difficult negative sample heterogeneity, thus obtaining more robust and accurate results.

6.2.2. Overall Performance on the UAV123 Dataset

All the video data in the UAV123 dataset were captured by unmanned aerial vehicles. Unmanned aerial vehicle (UAV) is widely used in intelligent inspection. Based on the similarity in this aspect, we further verify the effectiveness of the proposed method on UAV123 dataset. The overall tracking results are shown in Figure 4. Even the proposed ECO method, which is based on deep features, is no slouch when it comes to video taken by specialized unmanned aerial vehicles.

6.3. Actual Inspection Performance

We collected several actual inspection scene videos and made a small dataset for the experiment. The same evaluation metric as TC-128 was used to evaluate the overall performance of the proposed method. The precision plot and success plot are shown in Figure 5. We compared it with two classical methods, including the tracking methods KCF, CSK, and ECO, based on correlation filter, and the tracking methods TADT and high performance visual tracking with Siamese region proposal network (SiamRPN) [42] based on Siamese framework. It is worth noting that SiamRPN is a method combining detection and tracking, and our method is improved on the basis of TADT. The tracking methods KCF and CSK based on correlation filter used hand-crafted features, while ECO uses deep features. The correlation filter method using deep features is obviously better, which indicates that deep learning can effectively improve the robustness of tracking. Thanks to more deep features and online updates, ECO has better performance, but its speed is only 3FPS, which cannot meet real-time requirements of intelligent inspection tracking. The proposed method achieves 30FPS, and the accuracy is close to that of ECO. SiamRPN regards tracking as a one-shot detection problem, but due to the complexity of inspection scenarios, it is not suitable to detect every frame, so its performance is limited.

Figure 6 shows the tracking results of different tracking methods in four real scenarios. As can be seen from the first column, the detection-based tracking method SiamRPN is not suitable for this tracing task. It can be seen from the second column that the tracking methods CSK and KCF based on correlation filtering are not only difficult to adapt to the problem of target scale change but also difficult to obtain higher tracking accuracy. The deep-based tracking method has higher accuracy. Similarly, the third and fourth columns also show that the deep-based tracking method is better.

6.4. Analysis of Tracking Speed

In addition to the comparison of tracking accuracy, another important evaluation is about the comparison of tracking speed, because the actual tracking scene needs real-time tracking speed. We compared the scores and tracking speed of different algorithms in several videos, and the tracking performance is shown in Table 1. The tracking speed of CSK and KCF based on hand-crafted feature correlation filtering is up to 2118FPS and 969FPS, respectively, but their tracking performance is limited. The deep-based ECO performs better, but at less than 3FPS. The tracking method TADT based on Siamese network achieves real-time speed and high performance. Our method is improved on the basis of TADT, the tracking accuracy is improved to ECO, and the speed is still real-time.

6.5. Analysis of Compression Effect

In order to prove the performance of the proposed target-aware deep feature compression method more clearly, we compared the number of parameters of the original deep feature and the compressed deep feature, the floating point operations during correlation operation, the number of channels, the DP and AUC scores on TC-128 dataset, and the tracking speed, as shown in Table 2.

As can be seen from Table 2, the original deep feature has 1024 channels, so the number of parameters is very large, and the floating point operations are very large during the correlation operation, which leads to the inability to realize real-time tracking. After compression using the proposed method, the numbers of parameters and floating point operations are greatly reduced, only 12.7% of the original, so the tracking speed is greatly improved, and real-time tracking is realized.

From the comparison of tracking results, the tracking effect does not benefit from the original deep feature with more channels. After the compression of deep features, the reduction of feature channels does not lead to a lower tracking score, but a higher score. This shows that the proposed method can effectively compress depth features to achieve robust tracking.

7. Applicable Environment and Restrictions

In the intelligent inspection of power system based on mobile robot, the task of mobile robot is to read the indicator number of instrument panel using visual inspection technology. The whole task is divided into three parts. (1) The first part is to use detection technology to identify the instrument that needs to be read. (2) The second part is to use tracking technology to continuously track the instrument until the instrument picture is clear. (3) The third part is using segmentation method to read the indicator number of instrument. The tracking method proposed by us is mainly used for instrument continuous tracking in the second part. After detecting the instrument to be tracked, the importance weight of each channel is calculated using the proposed compression method (that is, to determine which channel features are retained). Then, the importance weight of each channel is used to cut the original deep feature in the tracking process. Finally, the deep features after cutting are used for correlation operations to obtain the score map, and the maximum value on the score map is the predicted target position.

As can be seen from the previously mentioned experimental results, be it the general tracking dataset TC-128 or the dataset UAV123 for specific scenarios or for actual inspection scenarios, our method has excellent performance in tracking robustness and tracking speed, indicating that our method has good scalability.

The restriction of this method is that the three tasks are divided into three visual tasks to be processed separately, and there will be some errors in the communication between the three parts, and the coordination between the three parts needs to be carefully adjusted. However, the feature compression method proposed by us has good transferability and can be applied in other tasks. In the future, we hope to integrate detection, tracking, and segmentation into the same network in the future, so as to achieve more efficient intelligent detection tasks.

8. Conclusion

It is very important to realize accurate and robust tracking in intelligent power inspection. The development of deep learning promotes the improvement of tracking performance. However, the tracking efficiency is very low due to the high dimensional data of deep features, so it cannot be applied to the intelligent inspection of electric power. In this paper, deep feature compression based on target-aware is proposed for power intelligent inspection system. This method retains important channel features more effectively and deletes unnecessary channel features through the designed negative balance loss. The compressed feature is combined with the Siamese frame to realize tracking, and the effectiveness of the proposed method is proved by experiments on two general tracking datasets and power inspection video.

Data Availability

The data used to support the findings of the study are available from the corresponding author upon request (email: [email protected]).

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61873246, 62072416, 62006213, 6167241, and 61702462), Program for Science and Technology Innovation Talents in Universities of Henan Province (21HASTIT028), Natural Science Foundation of Henan (202300410495), and Zhongyuan Science and Technology Innovation Leadership Program (214200510026).