Surface Defect Detection with Modified Real-Time Detector YOLOv3
In this paper, a modified YOLOv3 net has been proposed for surface defect detection. Different from other pixel-level segmenting methods, YOLOv3 locates the regions of surface defects with bounding rectangles. Compared with conventional detectors, the operating efficiency of YOLOv3 is rather high without generating region proposals by sliding boxes. Although pixel-level details of defects are omitted in the process, the primary information of the location of detects and class labels are extracted by YOLOv3 with high accuracy. This information is sufficient for surface defect inspection, and computational efficiency has been improved, simultaneously. To further light the structure of YOLOv3, loss function optimization and pruning strategy have been adopted in the original YOLOv3. The pruning ratio is determined by the tradeoff between detecting accuracy and computational efficiency. In our experiments, we compared the performance of modified YOLOv3 with several state-of-the-art methods, and modified YOLOv3 achieves the best performance on six types of surface defects in DAGM 2007 dataset.
Nowadays, artificial intelligence [1–3] plays an important role in various kinds of social applications. In the field of industrial automation, effective surface defection is crucial in the quality control of the industrial environment . The traditional manual detection method is time-consuming, and it does not apply to large scale products. Another primary source of difficulty in manual detection is that many of the factors of variation influence the detection accuracy. Deep learning, which has been gaining tremendous amounts of traction in recent years, especially convolutional neural networks (CNN), has achieved great success in many computer vision tasks such as image detection and classification . Following recent advances in deep learning, CNN is now commonly employed to target industrial inspection tasks.
Deep learning algorithms make use of the appropriate classifiers to solve the detection and classification of defects. Deep convolutional neural networks proposed in  is one of the breakthrough methods for image classification for high-complexity datasets. Several deep learning methods are used on the topic of surface defect inspection. Fully convolutional network (FCN) combines a segmentation stage and a detection stage, which are operated with two separate fully convolutional networks [7, 8]. ViDi is the first deep learning-based image analysis software designed specifically for factory automation  and also referred to in . Qiu et al. has proposed a pixel-wise surface defect segmentation method, which contains a segmentation stage, a detection stage, and a matting stage . Deep recurrent neural network (DRNN) consists of four modules: deep regression-based detection model, pixel-level false positive reduction, connected component analysis, and deep network for defect type classification . It can also be simplified to be a detector without the procedure of classification. These methods have been tested on the datasets of surface defect inspection and play certain effects on them.
A central family of automatic object detection algorithms is a region-based approach. Regionlets proposed in  are the subregions denoted in the feature extraction region to capture the possible locations of the target object. It first constructs a largely over-complete regionlet pool, and then each object bounding box is classified by a cascaded boosting classifier. On the other hand, CNN-based methods R-CNN  and its modified versions Fast R-CNN  and Faster R-CNN  use regions to hypothesize object locations within the image, where bounding box around the object of interest is created to lock the position. Note that Faster R-CNN speeds up the detection using neural networks to propose bounding boxes instead of selective search as used in Fast R-CNN. Generally, CNN-based methods work reasonably well on various kinds of targets . However, their operational efficiency and detecting accuracy still need to be further enhanced.
Unlike the anchor based algorithms, YOLO (You Only Look Once) is a real-time object detection system proposed by Redmon et al. in . It frames object detection as a regression problem, and a single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes, which makes YOLO extremely fast. To improve the location accuracy and recalling rate, Redmon and Farhadi in  came up with a better, faster, and stronger YOLO that can detect over 9000 different object categories, named as YOLO 9000. In this version, the fully connected layers from YOLO are removed, and anchor boxes are used to predict bounding boxes. The most advanced version of YOLO is known as YOLOv3 , where Darknet-53 is used to detect the objects via different scales of layers. Another appealing feature of YOLOv3 is that the end-user can simply change the model size to make the trade-off between detection accuracy and speed without retraining the model.
For the detecting tasks, objects with similar sizes and appearances are easily detected. For categories with large scale variances and changeable appearances, the detection accuracies are limited. Therefore, specific processing mechanisms need to be designed to adjust these variations. In this paper, a new item of loss function based on generalized intersection over union (GIoU loss) is designed to deal with these categories with large scale variations on training phase. Compared with original GIoU loss, the modified item pays more attention on the area difference between the predicted objects and ground truth. Furthermore, pruning categories are adopted to enhance operating efficiency. The running speed is increased largely with certain pruning percentages on network parameters.
The rest of the paper is organized as follows: We first give a brief review of the real-time object detection system YOLO and introduce the loss function optimization and pruned strategy. Then, the modified YOLOv3 net is applied to a popular benchmark dataset for multiclass object detection, DAGM 2007, wherein the full version of YOLOv3 and pruned version are compared with previous methods, including FCN [7, 8], ViDi , Qiu , DRNN  and its simplified version, and several variants of original YOLOv3. We wrap up with a general discussion and potential future works to defect detection.
2. Modification of YOLOv3
The series of YOLO networks are rather popular in the field of object detection, and these latest versions are powerful with thousands of categories. However, latest versions are good at extracting differentiation of details among different categories, and their structures are complex than typical version YOLOv3. These complex structures are easy to prone to overfitting issues without large amounts of training data. For example, the latest YOLOX network  is still using the backbone network of YOLOv3. For the applications of surface defect detection, less quantity of defect categories and samples are analyzed in each real application. Considering these above factors, YOLOv3 is chosen to be modified in this paper with certain training samples, considering of effectiveness and efficiency.
Different from other detecting methods, the anchor free strategy of YOLO finds targets without generating region proposals by sliding boxes. Since multiple independent logistic classifications and binary cross-entropy loss have been adopted in YOLOv3, the classification mechanism is enhanced than previous versions of YOLO networks. In the prediction, various scales of predicted boxes are used with different feature extraction structures. The information included each bounding box is one-time extracted, such as the potential location of objects and the corresponding class probabilities.
The backbone of adopted YOLOv3 in this paper is Darknet-53, and multiple residual blocks are combined to extract feature. The total number of layers of the YOLO is 53. Each convolutional/residual block contains convolutional filters with a size of and , batch normalization, and leaky ReLU. These residual blocks guarantee the convergence ability of the network with deep structures. The pipeline of YOLO is as follows. For a given input image, it first rescales the input image to be . The detection of objects is operated on different layers assigned with three types of scales. The layer is used to detect large objects, the layer detecting medium objects, and the layer is responsible for the smaller size objects, which significantly improve the performance for smaller objects detection compared with YOLO 9000.
In the modified structure of YOLOv3 in this paper, loss function optimization and network pruning strategy are adopted to adapt requirements of surface defect inspection, which is demonstrated in following subsections.
2.1. Loss Function
YOLOv3 further enhances the performance from the previous YOLO versions YOLOv1 and YOLO 9000 replacing softmax loss by Logistic loss, i.e., object confidence and class predictions in YOLOv3 are predicted through logistic regression. Although pixel-wise information of these defect regions is not segmented in detail, the bounding rectangles are detected and classified in high accuracy. This information extracted from the YOLOv3 is quite useful in the accessorial work of defect inspection in the assembly line. However, the estimated rectangles are not adaptive for these situations with large scale variations of different targets.
In this paper, modified loss functions are proposed to enhance the performances of YOLOv3. The newly added item of loss function is optimized based on generalized intersection over union (GIoU loss) . Based on GIoU loss, the area proportion between ground truth and corresponding estimated rectangle is considered in the item.
The modified GIoU loss is defined as with
where is area of the target, is the estimated area by YOLOv3, and is the smallest convex object of them, and
Since the intersection over union focuses on overlap ratio, this factor causes the compromise phenomenon that the estimated rectangles are always smaller than large targets and larger than the small targets. Although GIoU is not sensitive to sizes of object, this phenomenon is still exists. In the surface defect inspection, most of these types of surface defects are small, and few types are quite larger than others, and their estimated rectangles are smaller than areas of the targets in most cases. The added area proportion is used to improve this issue and lead the estimated rectangle to match the size of the target more properly.
With the modified GIoU loss, the total loss function is optimized as
where is the original loss function used in normal YOLOv3 , which is focusing on the accuracies of bounding box prediction, class prediction, and predictions across scales. The added item is used to further enhance the evaluating criterion on area ratio and intersection over union.
2.2. YOLO Pruning
To guarantee processing abilities to various types of targets, deep learning nets always contain complex structures and a large number of parameters. Therefore, computing resources should be secured to computational efficiency. Promotion to terminal devices has been limited with high hardware costs. Therefore, effective optimization schemes are required to light the structures of deep learning nets and change the model size to trade-off accuracy and speed without overhead training new models.
There are several strategies to optimize deep learning nets, such as weight quantization , branch pruning [23, 24], and model compression . With few accuracy loss, the number of parameters can be decreased to a large degree, and computational efficiency has been enhanced.
In this paper, branch pruning strategy is adopted to optimize YOLOv3. Based on the pruning criterion in , original YOLOv3 are pruned. The training objective is set to be
where is the input and target of training samples, is trainable weight of YOLOv3, is a sparsity-introduced -norm penalty on the scaling factor of batch normalization, and is a trade-off factor between the two terms. For each channel, a scaling factor is set and used to multiply with the channel’s output. Then, the network weights and scaling factors are trained jointly with sparsity regularization. These channels in convolutional layers with small scaling factor values are pruned. Then, the compact model is fine-tuned to achieve comparable accuracy as the original full network.
Without any special libraries/hardware for efficient inference, the number of parameters, memory requirements, and computing operations of the original YOLOv3 has been reduced, simultaneously.
In the field of surface defect inspection, the surface defects are always not easy to be found under open environments. Therefore, the surface images are captured with specific light source under closed environments without external influences, which is the mainstream operating way in real applications. In the experiments, two typical open source datasets are used to verify the performance of the modified algorithm, which contain DAGM 2007 (http://resources.mpi-inf.mpg.de/conferences/dagm/2007/prizes.html) and GC10-DET (https://github.com/lvxiaoming2019/GC10-DET-Metallic-Surface-Defect-Datasets) . Both of them are captured with special operations under closed environments. To see the performance of defect inspection, the proposed methods, noted as M-YOLOv3 (full/pruned), are compared with other state-of-the-art detectors, which contain the YOLOv3  and its pruned version , FCN , ViDi , Qiu , DRNN, and its simplified version with no classifier .
3.1. Experimental Study
DAGM 2007 is one of these typical datasets on the field of surface defects with visual information. This dataset contains ten types of different surface defects, and the first six types are used for public test. For this dataset, each image represents one surface sample, which may contain one defect or not. In our experiments, we test the first six types of DAGM 2007. To guarantee a sufficient quantity of experimental samples, all training samples with surface defects are chosen in the training phase. The testing samples with surface defects are split into two subsets: The first subset is used for validation in the training phase, and the other subset is used to test the performance of the trained algorithm. The quantities of samples in training and testing phases are 618 and 297, respectively. All these training and testing samples with defects have participated in our experiments uniformly, and the results are more convincing than fewer samples. For each image, the regions of detected rectangles with the same labels are merged as the region of surface defect.
GC10-DET is more challenging with more complex situations, which is collected in a real industry and contains ten types of different surface defects and released in 2019. On each image, the number and type of defects are not limited. These defect scales are varied largely than DAGM 2007. In our experiments, all of these ten types are tested. The data processing method are similar as DAGM 2007. The quantities of samples of GC10-DET in training and testing phases are 1583 and 689, respectively.
In our experiments, detecting accuracy is evaluated by several evaluation metrics, which contains where , , and are true positive, false positive, and false negative, respectively.
The IOU (intersection over the union) value is estimated by the ratio between overlap and union of detected results and marked rectangles of ground truths. Based on the situation of IOU, AR column is defined as
where is the number of defects and is the -th ground truths.
Further, mAP (mean average precision) is calculated by the mean value of AP (average precision), and AP is equal to the area below the precision-recall curve .
Different from the previous segmenting strategies, we choose typical detecting rectangles to represent the region of surface defects. This method of representation can enhance detecting efficiency, sharply. The simulation has been operated on a desktop with CPU 2.40GHz, 19.19 GB RAM, and 11.02 GB GPU with GeForce-RTX-2080-Ti. This project is mainly developed on PyTorch 3.7. In practice, the running times of the proposed M-YOLOv3 (full) on DAGM and GC10-DET are 29.84 fps and 28.30 fps, respectively. The running times of the proposed M-YOLOv3 (pruned, 15%) on DAGM and GC10-DET are 35.65 fps and 35.64 fps, respectively. Since the running speed is rather fast and estimated running time is not unstable, the running times in these experiments are averaged of ten times on all testing samples. The total parameters’ amount is 235.15 MB of M-YOLOv3 (full), and 15% parameters have been pruned by M-YOLOv3 (pruned, 15%). The detecting efficiency is better than most of these algorithms in Table 1, with less computing resources.
3.2. Result Comparison
The evaluation results of defect detection accuracy on DAGM 2007 are shown in Table 1. In our simulations, the full version of the modified YOLOv3 is noted as M-YOLOv3 (full), and the pruned modified version with pruning ratio % is noted as M-YOLOv3 (pruned, %). The full version of the original YOLOv3 is noted as YOLOv3 (full), and the pruned original version with pruning ratio % is noted as YOLOv3 (pruned, %). In the proposed M-YOLOv3 (full/pruned), the added item in loss functionis simplified as the GIoU function, with the selected optimized weight parameter 0.75 instead of area ratioin Equation (1), is constructed to compare with M-YOLOv3 and noted as YOLOv3-giou.
3.2.1. Performance Comparisons on DAGM
During the comparison, the M-YOLOv3 (full) and YOLOv3 (full) obtain the best and second-best overall performances on these evaluating criteria. The values of precision, recall, and F1 are even all larger than 97% for M-YOLOv3 (full) and better than 94% for YOLOv3 (full). M-YOLOv3 (pruned, 15%) is also better than YOLOv3 (pruned, 3%) on calculation quantity with similar performance and better accuracies than YOLOv3 (pruned, 15%) with similar calculation quantity. The YOLOv3-giou (pruned, 15%) has better performance than YOLOv3 (pruned, 15%) and worse than then M-YOLOv3 (pruned, 15%). The detecting accuracy of FCN, ViDi, and Qiu is rather poor in terms of AR and MeanIOU. The performances of DRNN (full) and DRNN (no classifier) are better than these three detectors but worse than the M-YOLOv3 (full) and M-YOLOv3 (pruned).
The performances of M-YOLOv3 (pruned) are even better than DRNN (full). Although pixel-level segmentation is not adopted in M-YOLOv3, the detected rectangles are still reliable in the process of locating these defects. Therefore, the M-YOLOv3 (pruned, 15%) net is more practical both on detecting precision and computational efficiency.
More details of M-YOLOv3 (full/pruned) and YOLOv3 (full/pruned) on each class of defects in DAGM 2007 are demonstrated in Tables 2, 3, and 4. The M-YOLOv3 (full) net achieves excellent accuracy on all these classes and YOLOv3 (full) net works well on Class 1/2/4/5. Since the areas of defects on Class 3 are rather small, the boundaries of detected regions are larger than ground truths for YOLOv3 (full) net. In contrast, the defects in Class 6 are even several times over other types of defects. These classes with unregular scales are well balanced by M-YOLOv3 (full) net. For M-YOLOv3 (pruned, 15%) (Table 3), most of the values on these terms are larger than 95% with few dropped ratios. Compared with M-YOLOv3 (full) net, some values have even appeared with a slight increase on Precision of Class 4/5/6. Only the recall, F1, and mAP turn to worse with certain dropped ratios, which is still better than YOLOv3-giou (pruned, 15%) with same pruning ratio. The YOLOv3-giou (pruned, 15%) net has competitive performances on Class 1/2/3 and turned to be poor on other three classes. For YOLOv3 (pruned, 3%) in Table 4, precision on Class 6 is even dropped with 9.48%. For YOLOv3 (pruned, 15%), all these values on four items are dropped with large dropped ratios.
As shown in Figure 1, the differences between the performances of M-YOLOv3 (full) and M-YOLOv3 (pruned, 15%) are not significant. Most of these values in terms of precision/recall/mAP are similar. The majority of these detected rectangles are located with high degrees of coincidences of ground truths. Compared with YOLOv3 (pruned, 3%), M-YOLOv3 (pruned, 15%) has similar performances on all these classes, with considerable results and slight mutual advantages of all these terms of criteria. YOLOv3 (pruned, 15%) has rather poor performances than other methods. The performance of M-YOLOv3 (pruned, 15%) only turn to worse on several examples and still competitive on the majority of these testing examples. The M-YOLOv3 has better performances than YOLOv3 with same pruning ratio 15%. The pruned version can be further pruned based on assigned pruning ratios, with certain loss of detecting accuracy. The selection between the full version and pruned version of the modified YOLOv3 needs to be balanced in practical applications based on the requirements of efficiency and accuracy.
(a) Precision/IOU threshold
(b) Recall/IOU threshold
(c) mAP/IOU threshold
The examples of several typical detected defects are shown in Figure 2. The train YOLOv3 can deal with various kinds of surface defects, and the scale and location of the detected results are proper for the demonstrated samples.
3.2.2. Performance Comparisons on GC10-DET
In order to further verify the performances of the proposed M-YOLOv3 (full/pruned), the results on GC10-DET dataset are further compared with YOLOv3. With similar operations on DAGM dataset, the proposed M-YOLOv3 has also achieved better overall performance than the original YOLOv3 network on GC10-DET dataset. Their performance comparison on each type on GC10-DET has been demonstrated in Table 5, with IOU threshold of 0.5.
As shown in Table 5, the proposed M-YOLOv3 (full) has the best overall performances on all items. The proposed M-YOLOv3 (pruned) has considerable performances with YOLOv3 (full) and even better on these items of Recall, F1, and mAP. YOLOv3 (full) only has better performances slightly on several items for individual types and is poor than M-YOLOv3 (full/pruned) on the remaining results.
Since the GC10-DET dataset is more challengeable than DAGM dataset, the overall performances on these items are poor than these values on DAGM dataset. The GC10-DET dataset is collected in real applications, and these items has better persuasive for performance comparison. With overall consideration of effectiveness and efficiency, the proposed M-YOLOv3 (pruned) is more practical in most real applications. For time insensitive cases, the proposed M-YOLOv3 (full) is the best choice among them. The examples of several typical detected defects in GC10-DET dataset with M-YOLOv3 (pruned, 15%) are shown in Figure 3.
In this paper, we adopt (pruned) M-YOLOv3 to deal with the topic of surface defects inspection. Compared with traditional processing methods with strategies of segmentation, the operating efficiency is improved significantly. Based on powerful processing abilities, various types of surface defects are detected by trained M-YOLOv3 with high accuracy. To further decrease operating efficiency, loss function optimization has been adopted with consideration of generalized intersection over union and area ratio. The pruning strategy is used to remove these branches of the YOLOv3 with less influence on its final outputs. The amount of parameters is reduced after a pruning strategy, with few impacts on the detecting accuracy.
Detected rectangles are useful in the purpose of locating the regions of surface defects. However, this representing method is still not considerately in detail, as too many normal pixels are included in the detected rectangles. The polygonal representation can be considered to modify the detected regions. Adversarial learning can be added in the training phase to deal with the issues of insufficient training examples and enhance the models’ robustness on generalization ability. Besides, more effective pruning strategies can be designed to further decrease the number of parameters, with less influence on detecting accuracy.
The DAGM 2007 dataset can be download from the website https://conferences.mpi-inf.mpg.de/dagm/2007/prizes.html.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This work is supported by Elite Plan of Shandong University of Science and Technology (No. 0104060540508), the Natural Science Foundation of Shandong Province under Grant (No. ZR2020MF132), and Research Funding of Post-Doctor who came to Shenzhen (No. E00120210001).
B. Jan, H. Farman, M. Khan et al., “Deep learning in big data analytics: a comparative study,” Computers & Electrical Engineering, vol. 75, pp. 275–287, 2019.View at: Google Scholar
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” advances in neural information processing systems, vol. 25, 2012.View at: Google Scholar
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440, Boston, MA, USA, 2015.View at: Google Scholar
Cognex Vidi of Cognex, September 2017https://www.cognex.cn/zh-cn/products/deep-learning/vidi-tools.
X. Wang, M. Yang, S. Zhu, and Y. Lin, “Regionlets for generic object detection,” in Proceedings of the IEEE international conference on computer vision, pp. 17–24, Sydney, Australia, 2013.View at: Google Scholar
R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based convolutional networks for accurate object detection and segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 1, pp. 142–158, 2016.View at: Google Scholar
R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, pp. 1440–1448, Santiago, Chile, 2015.View at: Google Scholar
S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: towards real-time object detection with region proposal networks,” advances in neural information processing systems, vol. 28, pp. 91–99, 2015.View at: Google Scholar
J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei, “Imagenet: a large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255, Miami, FL, USA, 2009.View at: Google Scholar
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788, Las Vegas, NV, USA, 2016.View at: Google Scholar
J. Redmon and A. Farhadi, “YOLO 9000: better, faster, stronger,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271, Honolulu, HI, USA, 2017.View at: Google Scholar
Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Yolox: exceeding yolo series in 2021,” 2021, http://arxiv.org/abs/2107.08430.View at: Google Scholar
H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: a metric and a loss for bounding box regression,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 658–666, Long Beach, CA, USA, 2019.View at: Google Scholar
Z. Cai, X. He, J. Sun, and N. Vasconcelos, “Deep learning with low precision by half-wave Gaussian quantization,” in Proceedings of the IEEE international conference on computer vision and pattern recognition, pp. 5406–5414, Honolulu, HI, USA, 2017.View at: Google Scholar
H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnet,” 2016, http://arxiv.org/abs/1608.08710.View at: Google Scholar
Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient convolutional networks through network slimming,” in Proceedings of the IEEE international conference on computer vision, Venice, Italy, 2017.View at: Google Scholar