The traditional image object detection algorithm applied in power inspection cannot effectively position power components, and the accuracy of recognition is low in scenes with some interference. In this research, we proposed a data-driven power detection method based on the improved YOLOv4-tiny model, which combined the ResNet-D module and the adjusted Res-CBAM to the backbone network of the existing YOLOv4-tiny module. We replaced the CSPOSANet module in the YOLOv4-tiny backbone network with the ResNet-D module to reduce the FLOPS required by the model. At the same time, the adjusted Res-CBAM whose feature fusion ways were replaced with stacking in the channels was combined as an auxiliary classifier. Finally, the features of five different receptive scales were used for prediction, and the display of the results was optimized by merging the prediction boxes. In the experiment, 57134 images collected on the power inspection line were processed and labeled, and the default anchor boxes were re-clustered, and the speed and accuracy of the model were evaluated by video and validation set of 3459 images. Processing multiple pictures and videos collected from the power inspection projects, we re-clustered the default anchor box and tested the speed and accuracy of the model. The results show that compared with the original YOLOv4-tiny model, the accuracy of our method that can position objects under occlusion and complex lighting conditions is guaranteed while the detection speed is about 13% faster.

1. Introduction

As an infrastructure related to the national economy and people’s livelihood, the power system is very important to modern society [1, 2]. Therefore, it is a very essential task to monitor whether the power components work safely and reliably. The traditional manual detection method requires people to work for a long time, and its effect is related to the experience and working status of the staff. The system is not capable of continuous monitoring and is less reliable. To maintain the normal operation of power equipment and improve the safety and reliability of the power system, the related algorithms have begun to be applied in the power industry [3, 9].

In power inspection, the background of the images is complex and has high timeliness requirements. Traditional target detection algorithms mainly use artificially designed features, such as Haar classifier [10, 11], cascaded classifier [12], SIFT (scale-invariant feature transform) [3, 4, 1315], HOG (histogram of oriented gradients) [16], DPM (deformable part model) [17], SVM (support vector machine) [18], and so on, or combine them [19, 20]. These methods’ [320] disadvantages are as follows: (1) Their feature extractors are manually selected and not robust to changes in complex practical application scenarios. (2) Their region selection strategies are based on sliding windows, with high time complexity and window redundancy, which affects the accuracy and speed of detections.

With the rapid development of neural networks and artificial intelligence technology, there have been a large number of deep learning-based object detection algorithms applied to power inspection [59]. These methods can be roughly divided into two categories. One is two-stage methods that search for the boxes first and then perform classification and regression, whose accuracy is higher but speed is slow, such as the R-CNN series algorithm based on Region Proposal (R-CNN [21], fast R-CNN [22], and faster R-CNN [23]. The other is the one-stage methods, using only one CNN to predict the categories and positions of different targets, such as SSD [24], RetinaNet [25], RefineDet [26], and YOLO (You Only Look Once) [27] series mentioned later. Although this type of method is lower in accuracy, it can achieve real-time detection.

There is a potential contradiction that higher accuracy is incompatible with the faster detection speed. In spite of the intrinsic difficulty in object detection algorithm, the application of UAV power real-time inspection is challenging owing to the interconnection and overlapping of the power components, illumination changes, complex background, diversity, and randomness of relative position and motion between target and camera.

Aiming at the real-time detection requirements of actual power project scenes, we propose a data-driven power detection method based on the improved YOLOv4-tiny. In detail, the Res-CBAM block was reconstructed and the short-cut in ResNet-D block was adjusted, and then the two blocks were combined in the backbone of YOLOv4-tiny. The results show that our method can adapt to scenes with multiple obstructions and is faster while ensuring accuracy.

2. Materials and Methods

2.1. UAV Image Dataset

The dataset in this paper was collected from the UAV images of power inspection, including 53675 in the training set and 3459 in the verification set. The power inspection images and videos were taken by the UAV with high resolution, diverse random angles, and fast video transition. Mark the key connection parts in the dataset, including three categories: wire connecting tower (dx_gt), tower connecting insulator (gt_jyz), and insulator connecting wire (jyz_dx). When labeling, try to cover the target area with the label box as much as possible and reduce other background information about the label box at the same time (Figure 1).

2.2. YOLO

YOLO was proposed by Redmon et al. [27]. It is a one-stage real-time object detection algorithm. Compared with the two-stage model, such as the R-CNN series, the YOLO series model simplifies the object detection task into a single regression task of boxes and prediction of classes, and the position and category of the object boxes are inferred from the input. Due to the abandonment of the default anchor, the YOLO series model is faster than the two-stage model in detection speed and can achieve real-time detection, but it also sacrifices a certain model accuracy and the ability to recognize small objects.

The YOLO series model inspection process is described in Figure 2. In the training phase, the image is divided into certain grids, and each grid has some default anchor boxes. Among these anchor boxes corresponding to the grid including the center of the real box, the one that most matches the real box—with the largest IOU—is responsible for detecting the real box and obtaining the information about the real boxes, that is, position information, confidence probability, and classification information. When the image is input into the YOLO model, the positioning loss , the object confidence loss , and the classification loss can be calculated, so the total loss function , where are the weights. Finally, the model parameters are updated through backpropagation. In the prediction phase, the parameters trained model are not updatable, and some corresponding categories and confidence levels of some possible prediction boxes are obtained after the image input to the model. Finally, the results are obtained through the process of nonmaximum suppression (NMS).

YOLOv4 [28] is an improved version of YOLOv3 [29] of the YOLO series. It combines many small tricks on the basis of the former, such as CutMix, mosaic, DropBlock regularization, SPP, and so on. Further, YOLOv4-tiny [30] is a simplified version of YOLOv4. The number of features used for classification and regression detection is simplified to two, and the parameters are simplified from 60 million to 6 million with the application of the CSPOSANet (CSP) which is similar to the ResNet module. When the size of the input is 608  608  3, the model structure is shown in Figure 3.

2.3. Improved YOLOv4-Tiny
2.3.1. ResNet-D

ResNet-D [31] is a modification to the ResNet [32] architecture shown in Figure 4. The motivation is that the 1  1 convolution with stride 2 for the downsample ignores 3/4 of the input in the unmodified ResNet. Thus, this is modified such that downsampling can be done by the next 3  3 convolutions in one path and by max-pooling in the short-cut path, and the loss of features caused by the simultaneous occurrence of 1  1 convolution and stride 2 is avoided.

FLOPS [33] (floating-point operations per second) are used to measure hardware performance. When the input feature map is 152  152  64, FLOPS required by the ResNet-D and CSPOSANet modules are shown in Table 1.

It can be known that the FLOPS required by the ResNet-D module are about one-tenth of CSPOSANet. In this article, we consider replacing the CSPOSANet module with ResNet-D to accelerate the model.

In particular, the short-cut in the ResNet-D means element addition, but in this paper, it means stacking on the channels, i.e., the output is 2n-dimensional in the channel when the input is n-dimensional in the channel.

2.3.2. Res-CBAM

Woo et al. [34] proposed CBAM (convolutional block attention module) (Figure 5). Channel attention module will calculate a channel weight whose size is (channel, 1, 1), that is, the values of height  width pixels in one channel are multiplied by the same weight, mainly focusing on the information in different input channel information; spatial attention module will calculate a spatial weight whose size is (1, height, width), which means that the values of different channels at the same pixel position are multiplied by the same weight, mainly focusing on the different location information entered. The specific calculation process iswhere and are the inputs of the channel attention module and spatial attention module, respectively, and and are the results of the channel attention module and spatial attention module, respectively. and represent channel attention and spatial attention, respectively.

We designed Res-CBAM based on CBAM, replacing the channel attention module with an upsampled average pooling layer and adjusting fusion method of features from point multiplication to stacking in the channel (Figure 6).

To combine the features extracted by the adjusted Res-CBAM and prevent the gradient explosion caused by the network module being too deep, two convolution operations and one convolution operation are added before and after the module, and the residual method is used to connect them.

2.3.3. Improved Model

In this article, we consider replacing the CSP module in the original YOLOv4-tiny with the ResNet-D module to speed up the model, but the ResNet-D module is small and may affect the accuracy. To balance the accuracy, the designed Res-CBAM is incorporated into the backbone network to assist in detection. The proposed backbone and model structure are shown in Figures 7 and 8.

2.4. Processing of Inputs and Outputs
2.4.1. Processing of Inputs

Taking into account the IO limitation on data training, the pictures are preprocessed to a certain extent in advance. We crop the image based on the object category and scale it to reduce the size of the input image.where the maximum side length size is set to , the width and height of the zoomed image are and , and the width and height of the zoomed image are and .

2.4.2. Processing of Outputs

Although the output of the model has been processed by nonmaximum suppression (NMS) to reduce the redundant result, in the actual scene test, the information about multiple boxes was output with some detection boxes overlapped which interferes with the normal inspection work. Therefore, we set the rule that for any prediction boxes () in the category , when the overlap (IOU) between them is greater than or equal to 0.5, it can be considered that the overlap of the two detection boxes is too large, so the two boxes are merged into a larger one (3).where

3. Experiment

3.1. k-Means Clustering

YOLO’s default anchor box is an empirical value obtained through k-means clustering on the COCO dataset of 80 categories. However, applying this default value to datasets collected in the projects may affect the convergence speed and accuracy of the model. Therefore, it is necessary to perform a re-clustering analysis on the object boxes of the dataset. The features of 19  19 and 38  38 are obtained when the input size is 608  608, and each scale corresponds to three anchor box values, a total of 6 anchors (Table 2). The feature of 19  19 corresponds to a larger receptive field, i.e., the large-scale anchor frame, while the feature of 38  38 corresponds to a smaller receptive field, i.e., the small-scale anchor frame. In this paper, the feature of 19  19 corresponds to the anchor frames 3, 4, and 5, and the feature of 38  38 corresponds to the anchor frames 1, 2, and 3.

3.2. Network Training

The training environment information in this article is shown in Table 3. The learning rate setting strategy is shown in Figure 9, and the learning rate of iteration can be expressed as follows:where the basic learning rate is , the initial learning rate change node is , the initial learning rate change coefficient is n (n = 4), the maximum number of iterations is , and the learning rate change nodes are , ().

Online data enhancement is uniformly used in training, and the strategies adopted are mosaic and Gaussian_noise. The mosaic strategy means combining multiple images to enrich the object information and detect the background of the object, and Gaussian_noise means adding Gaussian noise.

4. Results and Discussion

We used mAP (mean value of average precision), AvgFPS (average frames per second), and GPU (graphic processing unit) memory occupancy as indicators to evaluate the model, where mAP refers to the average accuracy of all categories detected by the model, and AvgFPS refer to the average number of frames that can be detected per second, and GPU memory usage refers to the GPU memory size occupied by the model when it is running. In this article, the YOLOv4-tiny model was compared with our method, and we tested the performance of the models trained with the default anchor and the anchors re-clustered (Table 4).

Table 4 shows the following:(1)Whether it is the model we proposed or the YOLOv4-tiny, the accuracy of the model has decreased by about 10% with the re-clustered anchor.(2)The accuracy of YOLOv4-tiny is higher, and the AvgFPS of our method are higher. The accuracy of our method is about 1% lower than that of YOLOv4-tiny, but in terms of recognition speed, the model we proposed is 13% faster than the YOLOv4-tiny, and the increase is much greater than the decrease in model accuracy.(3)We test the models on the GTX1080Ti with the 1080P inspection video, and the results showed that both can reach a speed of more than 210 AvgFPS, which is beyond the real-time detection standard.

Figures 10 and 11 show the results of six pictures of the power line, and some conclusions can be drawn as follows:(1)As shown in Figures 10(a), 10(b), 11(a), and 11(b), both methods can better recognize the object on the image, even if the object is located at the boundary of the image, and the method we proposed has higher confidence in predicting the frame than that of YOLOv4-tiny. The confidence levels of the box are larger, but in Figures 10(a) and 11(a), YOLOv4-tiny predicts a correct box more than the method we proposed. As shown in Figures 10(a) and 11(a), the confidence levels of the boxes predicted by our method are 83.29, 48.53, 53.46, and 64.65, respectively, and the confidence levels of the boxes predicted by YOLOv4-tiny are 36.69, 56.98, 37.19, and 30.56, respectively; as shown in Figures 10(b) and 11(b), the confidence levels of the boxes predicted by our method are 98.72, 61.49, and 45.51, and the confidence levels of the boxes predicted by YOLOv4-tiny are 97.80, 73.83, and 37.20, respectively.(2)As shown in Figures 10(c), 10(d), 11(c), and 11(d), the model we proposed can better position the partially occluded jyz_dx object, but YOLOv4-tiny does not detect the occluded object, indicating that the model we proposed is better in the ability of recognizing occlusion object.(3)As shown in Figures 10(e), 10(f), 11(e), and 11(f), when the shooting conditions are not good (backlighting and underexposure), both methods can position the object within a certain limit. It can be proven that the performance of our method is generally similar to YOLOv4-tiny. In terms of the ability to recognize occluded targets, the model we proposed is better.

When the model input size is 608  608, the weight file sizes of the YOLOv4-tiny model and the model we proposed are 22.4 MB and 25.1 MB, respectively. In actual deployment (Table 5), the memory occupied by YOLOv4-tiny in a single GPU when training and testing is 1163 MB and 501 MB, respectively. The memory occupied in a single GPU when training and testing the model we proposed is 1150 MB and 485 MB, respectively. When the model input size is 608  608, the weight file sizes of the YOLOv4-tiny model and the model we proposed are 22.4 MB and 25.1 MB, respectively. In actual employment (Table 6), the memory occupied by YOLOv4-tiny in a single GPU when training and testing is 1163 MB and 501 MB, respectively. The memory occupied in single GPU when training and testing of the model we proposed is 1150 MB and 485 MB, respectively. Based on the above analysis, the weight file of the model we proposed is about 12% larger than that of YOLOv4-tiny, but it takes about 2% less on the GPU occupancy. At the same time, the detection speed of the model we proposed is about 13% faster, suitable for deployment on some edge computing devices with weak GPU performance.

Jetson TX2 and Jetson Xavier NX are both lightweight AI development boards with CUDA computing units provided by NVIDIA. They are edge computing modules for embedded devices that can be applied to real-time detection of actual power inspection. We use the same 1080P video collected in the actual power line to test the AvgFPS of YOLOv4-tiny and the model we proposed on Jetson TX2 and Jetson Xavier NX devices (Table 6). The AvgFPS of YOLOv4-tiny and the model we proposed are 33.8 and 38.9 in Jetson TX2 and 45.2 and 52.1 in Jetson Xavier NX, both of which can achieve real-time detection. At the same time, the speed of the model we proposed is increased by about 15% than YOLOv4-tiny, which is slightly higher than that of 1080Ti.

5. Conclusion

In this paper, an improved YOLOv4-tiny network for power inspection that combines the ResNet-D block and the adjusted Res-CBAM block is proposed to solve the problem of inefficient power inspection target identification. The result shows that the accuracy was ensured, and the speed is increased by 13%. The model we proposed achieves the 0.6401 mAP and the 248 FPS on 1080Ti. The object can be identified with a high degree of confidence in the occlusion and bad lighting conditions; even real-time detection (above 35FPS) can be achieved with a smaller GPU occupancy rate on weaker GPU devices (Jetson TX2 and Jetson Xavier NX).

Data Availability

The data presented in this study are available on request from the corresponding author.


The funders had no role in the design of the study; in the collection, analyses, or interpretation of the data; in the writing of the manuscript; or in the decision to publish the results.

Conflicts of Interest

The authors declare that they have no conflicts of interest.


This research was funded by the Science and Technology Project of SGCC (Research and application of audio-visual active perception and collaborative cognitive technology for smart grid operation and maintenance scenarios) (5600-202046347A-0-0-00). The authors would like to thank Electric Power Research Institute of State Grid Henan Electric Power Company and Beijing Imperial Image Intelligent Technology Co. for support in the work in this paper.