Abstract

In recent years, researches in the field of salient object detection have been widely made in many industrial visual inspection tasks. Automated surface inspection (ASI) can be regarded as one of the most challenging tasks in computer vision because of its high cost of data acquisition, serious imbalance of test samples, and high real-time requirement. Inspired by the requirements of industrial ASI and the methods of salient object detection (SOD), a task mode of defect type classification plus defect area segmentation and a novel deeper and mixed supervision network (DMS) architecture is proposed. The backbone network ResNeXt-101 was pretrained on ImageNet. Firstly, we extract five multiscale feature maps from backbone and concatenate them layer by layer. In addition, to obtain the classification prediction and saliency maps in one stage, the image-level and pixel-level ground truth is trained in a same side output network. Supervision signal is imposed on each side layer to realize deeper and mixed training for the network. Furthermore, the DMS network is equipped with residual refinement mechanism to refine the saliency maps of input images. We evaluate the DMS network on 4 open access ASI datasets and compare it with other 20 methods, which indicates that mixed supervision can significantly improve the accuracy of saliency segmentation. Experiment results show that the proposed method can achieve the state-of-the-art performance.

1. Introduction

Surface defect detection based on computer vision is an important task in the industry. In traditional cases, object surface defect detection is performed by the human eye. However, such artificial recognition-based detection methods are highly subjective, time-consuming, and lack of accuracy. To overcome the limitations of manual inspection, automatic surface inspection (ASI) technology arises to replace human decision.

In industry, automatic surface inspection task is detecting local anomalies in uniform textures. These textures can be divided into uniform textures and uneven textures. Surface inspection objects include steel [1], wood [2], stone [3], ceramic tile [4], and fabric [5].

To achieve automatic surface inception, many image processing-based methods have been proposed. Traditional ASI methods can be mainly divided into four categories: structural method, statistical method, filter-based method, and model-based method [6]. Structural methods simulate primitives and displacements and are often used in repetitive patterns, including roughness measurements, boundary features, and morphology [7]. Statistical methods, which measure the distribution of pixel values, are commonly used in the detection of random textures (wood, castings, and tiles), including histogram method [8], local binary pattern (LBP) [9], and gray-level co-occurrence matrix (GLCM) [10]. The filter-based method, which can be divided into the spatial domain method [11] and frequency domain method [12], directly applies a filter bank to the texture patterns. And the model-based approach builds a complete representation of the defect by modeling multiple features of the defect [4]. In general, despite the wide variety of automated surface inspection methods, the purpose of these traditional methods is to construct templates or features of the image. Model performance depends on the accuracy of modeling defects, which means the generalization ability of the model has great limitations.

In recent years, convolution neural networks have been widely used in computer vision tasks such as image classification, object detection, semantic segmentation, and salient object detection. The neural network has also become the mainstream of automatic surface inspection tasks. It has powerful image feature learning and generalization ability, which avoids the defect that the traditional ASI method relies too much on hand craft design features. In addition, the generic surface defect detection model has become possible. Park et al. [13] designed a simple CNN network to classify the surface defects of six different materials, which is far superior to the traditional feature extraction with classifier. Weimer et al. [14] used the CNN models to semantically segment the defect regions of the repeated texture patterns in the DAGM2007 dataset, and the segmentation performance was significantly improved.

With the introduction of neural network models and the continuous improvement in algorithm performance, the ASI task model is also evolving. The ASI task evolved from image-level defect recognition classification to finer-grained pixel-level segmentation or object detection.

Ren et al. [15] redefined the task mode of ASI task classification plus segmentation. Based on the Decaf network, they build a general automatic surface detection method. Then they perform image-level defect recognition and segment the defect area at the pixel level by a pixel-by-pixel hot zone algorithm. As the ASI dataset has the characteristics of clear defect categories and clear foreground background in a single picture, comparing with the general saliency segmentation in the free scene or the more complex semantic segmentation in the foreground, the task model of defect classification plus saliency segmentation is obviously a more reasonable choice. To solve the surface defect detection task of magnetic tile, Huang et al. [16] designed a surface defect detection network based on neural network and saliency detection method, realizing the real-time detection of surface defects of the magnetic tile. This research also fully demonstrates that the introduction of image saliency detection can greatly help solve the ASI task.

Although the deep learning method has achieved remarkable results in many computer vision tasks, its application of the ASI tasks has been limited by various factors. First, the deep learning method requires hundreds and thousands of training data to ensure the training effect of the model and prevent over-fitting. However, the collection of images in industrial scenes is difficult and expensive, only a few hundred or even dozens of images in the ASI dataset. In addition, unlike the general scene target detection and saliency detection tasks, most of the samples in practical industrial applications are negative samples, and it is expensive and inefficient to directly perform image detection or segmentation on all samples. Therefore, it is of great significance to discuss how to efficiently and accurately identify and segment surface defect regions with neural network. Finally, different from the saliency detection in natural scenes, the foreground of ASI tasks is usually the small-scale targets that are difficult to detect in traditional algorithms, such as holes and cracks. How to effectively divide these small-scale targets is also a huge algorithm challenge faced by ASI tasks.

In this paper, ASI task is defined as defect classification plus defect area segmentation. We propose the deeper and mixed supervision network (DMS), an innovative generic surface defect method to fulfill the multiscale classification and salient defects detection in one stage. To achieve this, as Figure 1 illustrates, we first extract five different layers as side outputs from the backbone network and then integrate them into three levels of feature maps. Second, we concatenate different level feature maps layer by layer in side outputs. Finally, we impose image-level ground truth and pixel-level ground truth in each feature layer to realize deeper and mixed supervision. In training, we designed a loss function to balance the weight of the classification and saliency segmentation. Finally, we refine the residual by the multi-bypass output of the DMS network to obtain the classification prediction and pixel-by-pixel prediction of the object defect. The test results applied to the four ASI open source data sets show that the mixed supervision mechanism of the DMS network can improve the saliency segmentation results. In the classification and segmentation tasks, the models we propose can achieve the best results of the current ASI tasks.

In summary, our contributions are four folds:(i)Based on the multilevel side output architecture of HED, we propose a novel deep network architecture, i.e., DMS (including SDMS and BDMS) network, which combines the recurrent high-medium-low feature concatenate and residual refinement mechanism.(ii)We propose a mixed supervision mechanism, which can fulfill defect classification and foreground segmentation in one stage. Besides, mixed supervision significantly improved the performance of SOD. We also propose our loss function to balance the weight of classification and foreground segmentation. In addition, mixed supervision provides a solution for processing the nonsalient samples, which is one of the most challenging tasks in generic SOD.(iii)We further explore the industrial application of salient object detection, while most of the current application focuses on wild scenes.(iv)We evaluate our network in four ASI datasets and compare it with other methods. Overall, DMS network reaches the state-of-the-art performance for SOD in ASI.

2.1. Fully Supervised Salient Object Detection

Saliency detection is a detection method that defines image content as background and foreground, detects the foreground according to the salient features, and divides it pixel by pixel. Many traditional methods are employed, fusing hand-crafted features for salient object detection [17, 18]. In recent years, neural network algorithms, especially fully convolutional neural networks (FCN) [19], have dominated many fields of computer vision due to its convincing performance. For example, Zhang et al. [20] proposed a novel FCN-based structure to learn deep uncertain convolutional features and significantly encourage the robustness and accuracy of saliency detection. However, these SOD models generally require a large number of pixel-level images for training and can only give a foreground inference, which cannot meet the ASI task requirements. In contrast, we use image-level ground truth for enhanced training in our approach, and we can get defect classification and saliency segmentation results simultaneously.

2.2. Salient Object Detection with Image-Level Supervision

In early saliency segmentation tasks, training data were typically based entirely on expensive pixel-level image data, while data with image-level ground truth were rarely used in saliency detection. This is because typically the task of an image-level ground truth focuses on the category of the object in the image rather than the specific location, while the saliency detection is intended to detect the full extended area of the foreground object and ignore its specific category. However, research by Wang et al. [21] shows that image classification and saliency segmentation tasks are essentially interrelated, as the candidate regions provided by saliency detection help classify more accurately, while the categories provided by image-level ground truth are likely to be the foreground of the image. This method approaches or even surpasses the state-of-the-art fully supervised model by using image-level labels at that time. Particularly, in ASI tasks, the categories of defects are limited and normally have quite different features. It indicates that image-level ground truth may be more useful when inferring the foreground areas. WSS fully demonstrates the contribution to image-level supervision in saliency detection, which provides inspiration for our DMS network.

2.3. Feature Concatenate and Dense Supervision Refinement

Feature concatenate and shortcut connection are one of the hotspots of neural network model research in recent years. He et al. [22] who proposed ResNet is the first to propose the mechanism of shortcut, which is a challenge to the traditional neural network with the connection only between two adjacent layers. On this basis, DenseNet [23] applies more dense connections and bypass setting with the assumption that feature concatenate is a better learning method than its multiple learning redundancy features. In object detection, the FPN satisfies the requirements of the detection task by combining the location information of the low-level feature map and the classification information of the high-level feature map. These studies have fully proved that the potential of the feature layer in the traditional neural network has not been fully explored. Recently, many saliency detection models have enhanced the detection results by combining low-level structural features and high-level semantic features through short connection and obtained obvious effects.

Deng et al. [24] designed a residual refinement block (RRB). They concatenate the input rough saliency map with the depth feature layer, and the residual map is output and supervised to form a new saliency map, which is used as the input map for the next round of circular refinement. R3Net achieves the refinement of the saliency maps by repeatedly concatenating the high- and low-layer features, which improves the effect of the saliency detection. Zhang et al. [25] studied how to better aggregate multilevel convolutional feature maps for salient object detection. They proposed a novel structure to combine the multilevel feature maps at each resolution and predict saliency maps with the combined features. Those convincing studies indicate that multilevel feature maps that are generated by FCN are complementary.

Most recently, a large number of edge information enhancement methods have been proposed [2630]. Zhao et al. [26] proposed to use the complementarity of edge information and saliency information to enhance the boundary and location information of saliency objects. Wu et al. [30] combined the SOD with edge detection and developed a novel mutual learning module (MLM) to help the foreground contour and edge detection tasks guide each other simultaneously. It is obvious that reasonable additional information is a useful complement to the SOD task.

The DMS network proposed in this paper combines the mechanism of multilevel feature concatenate, deep supervision, and residual refinement. The DMS backbone network is divided into three feature layers of low, medium, and high, and the network performs multiscale feature concatenate by means of short connection. The multi-bypass configuration satisfies the requirements of deep supervision and residual refinement.

3. Methodology

We show the structure of single deeper and mixed supervision network (SDMS) in Figure 2. Figure 3 shows the proposed structure of the bilateral deeper mixed supervision (BDMS) network. We first select five different scales feature maps of input images as side outputs through ResNeXt101 backbone network. The side outputs of different layers contain low-level details and high-level semantic information, respectively. We consider that the first-layer and the second-layer feature maps are integrated as the low-level feature (LF), third-layer feature maps are used as the middle-level feature (MF), and the fourth-layer and fifth-layer feature maps are integrated as the high-level feature (HF). For each feature layer (side output), we set an independent convolution filter to generate connectable feature maps and corresponding saliency map, and the parameters corresponding to each feature layer are as shown in Table 1. We use the high-level features to generate the original saliency map and classification information and then concatenate the middle-level features and the original saliency map to generate detailed saliency maps and classification information. We finally concatenate the low-level features to generate saliency maps. We design a mixed loss function to adjust the loss weight of the saliency segmentation and defect classification and supervise the training of each level of the saliency map and the classification signal. In the SDMS and BDMS, we, respectively, adopt the saliency map after third-level/six-level feature reuse and residual refinement as the final output saliency map and obtain the image classification prediction. In the following subsections, we will elaborate on the specific architecture of SDMS and BDMS, weighted mixing losses, and the detail of surface defect detection task when the proposed model is applied.

3.1. Deeper and Mixed Supervision

The ultimate goal of this paper is to achieve classification of defect categories and saliency segmentation of defect regions. However, unlike general SOD tasks, there are too many normal samples in the ASI task, i.e., nonsalient samples. However, most of the current SOD methods have neglected the large number of nonsalient samples, especially in ASI. It is worth mentioning that the recent research by Fan et al. [31] also shows that most of the current SOD data sets and researchers’ selective neglect of nonsalient samples lead to a large number of SOD models to perform huge difference in real-world scenes. In order to solve this drawback and combine the actual requirements of the ASI task, this paper proposes a mixed supervisory model based on multilevel side output implementation, which uses image-level weak supervised labels to enhance the saliency detection effectiveness and can effectively classify and process nonsalient samples in actual ASI tasks, and significantly enhances the ability of the SOD model to process nonsalient samples.

In Figure 3, we use the ResNeXt101 as backbone network to extract five sets of different scale feature maps of the input image as the side output. The first-layer and second-layer feature maps are combined as the middle-level feature (LF), the third-layer feature maps are used as the middle-level feature (MF), and the fourth-layer and fifth-layer feature maps are combined into the high-level feature (HF). We use the advanced features to generate the original saliency map and classification information, then concatenate the middle-level features and the original saliency map to generate detailed saliency maps and classification information, and finally concatenate the low-level features and generate saliency maps. Both of the saliency maps will be upsampled to the input size. Supervision signals were imposed on each level of saliency maps and classification results.

3.2. Improved Side Output Architecture

It is generally believed that, in neural networks, low-level features contain more detailed information, while higher-level features contain more semantic and positioning information, so the weighted average results of extracting multiscale side outputs in detection tasks tend to get better test results. Multiscale side output supervision was initially widely used in areas such as edge detection [32]. Hou et al. [33] obtained an improved DSS architecture based on the HED architecture by combining a specific short-connection structure with side outputs of different scales and successfully applied in the field of saliency detection. However, in DSS, since the side output layer is compressed into a single channel before making a short connection, there may be significant information loss between the short connections [34]. From the perspective of improving the efficiency of feature reuse, we do not intend to fully refer to the short-connection architecture in DSS but try to make a more complete concatenate of the side output.

The backbone of the DMS network uses a similar processing approach to the classic HED [32] architecture, taking five side output layers from different depths of the backbone network to ensure multiscale characteristics of the side output layer. We observe that, in the previous study, no researchers discussed the specific meaning of the five-layer side output architecture and they simply classified it into different information aggregations provided by low-level features and high-level features. In order to make the meaning of the side output layer more typical, we aggregate side output layers 0 and 1 into low-level feature (LF), side output layer 2 into the middle-level feature (MF), and side output layers 3 and 4 into high-level feature (HF) as shown in Figure 2. Since the shallower feature layer has a larger size, we upsample the relatively deep feature layer to the same size and then join them as follows:where LF, MF, and HF, respectively, represent the low-level features, middle-level features, and high-level features; represents the n-th layer side output; represents the feature layer upsampling, in the lower layer features is upsampled to the same scale as , and in the high-level feature, is upsampled to the same scale as ; represents the concatenate of feature layers; and represents a set of convolutional layers used to aggregate feature layers. On the one hand, reducing the number of feature layers effectively reduces the computational cost of side output stitching. On the other hand, the side output layer is more visual, which helps us to further discuss the association between the side output layers.

In the SDMS network, we first start from the high-level feature HF as ; after a set of transposed convolutions (so that the feature layer has the same scale as the next layer); it is upsampled to obtain the primary saliency map ; and then we aggregate the after-transposed-convolution high-level feature layer HF and middle-level feature layer MF into . Next, we draw on the idea of residual refinement [24], stitching the primary saliency map and the feature layer into the residual layer . The next saliency map is obtained by refining the initial saliency map by . The specific definition is as follows:where denotes the i-th feature layer; represents the corresponding side output layer (including HF, MF, and LF); represents the i-th residual layer (the number of channels is 1); represents the i-th saliency map; and indicates the transposed convolution corresponding to the size of the next-level feature layer.

When using the above formula for calculation, an obvious problem appears that, after the multilayer residual refinement, the constantly superimposed feature layer generated for feature concatenate will cause huge computational overhead, so it is difficult to build a deeper architecture using SDMS network. To further optimize the saliency map, we propose the dual-stream architecture, the BDMS network. In BDMS, we use a dual-stream architecture with top-down and bottom-up feature concatenates. Specifically, we retain the resulting saliency map after performing a top-down deep mixed supervision, reset to F, and then perform a second round saliency map refinement from bottom to top:

The practice of resetting the side output features effectively reduces the computational overhead of the neural network and provides greater scalability for the DMS network structure. In this paper, we chose the BDMS structure shown in Figure 3 as the final architecture and obtain the sixth-level saliency map as the final result. The experimental result shows that the training time of the model is significantly reduced after reset side output.

3.3. Weighted Loss Function

In order to satisfy the mixed supervision of classification result and saliency maps, we design a weighted mixed loss function for neural network training. We set open source data as , where and , respectively, represent an input image with a pixel value of P and a binarized truth value map and represents a classification label corresponding to the image. We design our weighted mixed loss function based on the cross-entropy loss function. In particular, the formula for the weighted mixed loss function in the n-th output is as follows:where and represent the classification prediction and the saliency maps, respectively; represents the predicted value of the k-th pixel in the saliency maps, while the total number of pixels is P; represents the true value of the k-th pixel, where represents the foreground pixel and represents the background pixel; represents the classification probability that the image belongs to the j-th class in the N categories; represents the true category of the image; and , respectively, represent the weights of the foreground segmentation and the classification part; and n represents the n-th level output. In our experiments, we set the values of and to 1 and 0.01, respectively, to balance the loss function of foreground segmentation and classification.

The above formula shows how we calculate the loss function of the n-th-order output. For the entire neural network, our complete loss function is defined as the weighted sum of the output loss functions of each stage:where and represent the weight and loss functions of the n-th stage output and N depends on the number of layers of feature concatenate and takes value 6 in the BDMS network. In the experiment of this paper, we will not discuss the weight of each feature layer, so is set to 1 uniformly.

4. Experiments

In this section, we mainly illustrate the experimental parameters and the training test details of the model. We focus on the ASI dataset we used and present the experimental results.

4.1. Training and Inference Settings
4.1.1. Training Parameters

We implement the model based on the PyTorch 0.4.0. All models were trained and tested on an NVIDIA GeForce GTX Titan Xp GPU (12 G Memory). As the number of saliency detection data sets, especially ASI data sets, is usually limited, a trained backbone network is necessary. In this paper, we use the weight of ImageNet pretrained ResNeXt101 [35] network as the initial parameters of the backbone network. We use the standard stochastic gradient descent (SGD) training network, with the size of each batch is 16, the momentum is set to 0.9, and the weight attenuation is 0.0005. And the initial learning rate is set to 0.001, using a polynomial decay with a power of 0.9. Model training is finished after 20000 iterations, and we save the best and the latest model.

The SOD model is usually trained with the MSRA10K [36] dataset and verified on other datasets. Considering the limited size of the ASI datasets and the particularity of mixed supervision, we divided it into training set and test set in the same ASI dataset. The ASI datasets is introduced in Section 4.3.

4.1.2. Inference

During the testing stage, we input the test image into the trained network and obtain the saliency map and classification result of each side output, without any other preprocessing or postprocessing. We estimate the classification accuracy of the image and the and MAE values of the saliency map while generating image reasoning and save the relatively better results. In general, multilevel optimized saliency map has better metrics.

4.2. Evaluation Metrics

In the classification task, we use the classification accuracy to evaluate the model classification effectiveness. For saliency testing, we use two commonly used metrics, F measure () and mean absolute error (MAE), to evaluate our DMS network. A good saliency network usually has a larger and a smaller MAE. For a saliency map y with pixel value P, we linearly map the pixel values from [0, 255] to [0, 1] and compare it with the truth map Y. The MAE calculation formula is as follows:

And the F measure calculation formula is

Usually, is chosen as the recommended value for the accuracy of the saliency maps. These metrics are discussed in our experimental results below.

4.3. ASI Datasets

In order to verify the effectiveness of the proposed automatic surface defect detection model, we select three surface defect detection public data for experiments. These datasets include the magnetic tile surface defect dataset [16], NEU surface defect dataset [37], rail surface defect dataset [38], and the road crack defect dataset [39].

4.3.1. Magnetic Tile Surface Defect Dataset

The first dataset is magnetic tile surface defect dataset (MTDD) [16]. MTDD contains 1344 images which was divided into six categories, including five types of defect and one type of defect-free map, named as Blowhole, Break, Crack, Fray, Uneven, and Free. Examples of different defect images and ground truth are shown in Figure 4. We divide this dataset into training set (1118 images) and test set (226 images). Both of the training set and test set share the same category distribution.

4.3.2. NEU Surface Defect Database

The second open source dataset is the NEU strip surface defect dataset [37]. The NEU dataset includes six defects of the hot rolled strip surface, including cracks (Cr), inclusions (In), plaques (Pa), pits (Ps), holes (Rs), and scratches (Sc), with 300 images each. Examples of defect images are shown in Figure 5.

4.3.3. Rail Surface Defect Datasets

The third open source dataset is the rail surface defect datasets (RSDDs) [38]. It contains two types of dataset: the first is Type-I RSDDs dataset captured from express rails, which has 67 challenging images, and the second is Type-II RSDDs dataset captured from common/heavy haul rails, which has 128 challenging images. Examples of defect images are shown in Figure 6.

4.3.4. Road Crack Dataset

The fourth open source dataset is the road crack dataset [39]. The road crack dataset does not classify road cracks but provides roadmaps and pixel-level labels, including a total of 151 images. Examples of defect images are shown in Figure 7.

4.4. Ablation Analysis

In order to fully evaluate the performance of mixed monitoring and feature concatenate mechanisms in the ASI dataset, we perform ablation analysis on DMS network. We use the same backbone network and the hyperparameters in a series of simplified SDMS networks and compare them with standard SDMS network to verify the concatenate of the side output and effectiveness of mixed supervision mechanism. The specific setting of the ablation models is shown in Table 2. The results of different models are show in Table 3.

Table 2 shows the test results for different settings. We can draw the following conclusions from the experiment: (1) The accuracy of saliency maps in each side output layer of the standard SDMS network are increased, indicating the effectiveness of the residual refinement mechanism. (2) The comparison result between the standard SDMS and SDMS-A networks proves that side output concatenate is effective to improve the accuracy of salient defect detection. (3) The comparison between the standard SDMS and SDMS-B networks shows that the mixed supervision mechanism has a significant improvement in the accuracy of the detection and proves the conclusion that image-level labels can effectively enhance the saliency segmentation results. (4) The comparison between the standard SDMS and SDMS-C networks shows that the introduction of the ASPP mechanism does not enhance the network effectiveness, meanwhile increased training time for the DMS network by about 40%. Due to this, we finally abandon the ASPP module. (5) Comparison between the standard and SDMS-D indicates that feature concatenate is a more advanced method than skip connection in reuse the feature layer. (6) It seems standard SDMS has better performance than SDMS-E, even though SDMS-E has better score in F measure. It indicates that residual refinement is an effective mechanism for optimizing the saliency maps.

4.5. Model Comparison

We compare the effectiveness of the DMS model with 16 saliency methods, including 12 traditional saliency algorithms (ITTI [40], LC [41], SR [42], AC [43], FT [44], MSS [45], PHOT [46], HC [47], RC [47], SF [48], BMS [49], and MBP [50]), and 8 deep learning methods (U-Net [51], FCN [19], R3Net [24], DSS [33], PiCANet [52], BASNet [29], PoolNet [53], and EGNet [26]). We implement the traditional saliency algorithm through the toolbox provided in [16]. In particularly, we test traditional saliency methods in the test dataset without free type, which causes significant interference with experimental results. For fair comparation, we realize the deep learning method by running the code directly provided by authors. Table 4 mainly shows the expert results on the MTDD dataset. Without any preprocessing and postprocessing, the proposed method outperforms those state-of-the-art methods. Figure 8 provides several examples of different defects, where our method is obviously better than others. Table 5 shows the test results on the other three datasets, which verified the effectiveness of the proposed method. In addition, our proposed method runs at about 7 FPS in GPU with input size 300  300.

We also test the effectiveness of the DMS network on the three ASI datasets of RSDDs, road cracks, and NEU. The test result shows that the DMS network can mostly meet the foreground division of surface defects, but there is still room for improvement in the accuracy of segmentation on small defects such as cracks.

5. Conclusion

This paper presents an optimized deep and mixed supervision network for surface defect saliency detection. The network is improved from the basic HED architecture and is equipped with layer-by-layer feature concatenate structure in the side output network. We design our loss function and add the classify module in DMS, in order to joint training classification and saliency segmentation in one stage. In the side output network, we divide the side output into high-level features, middle-level features, and low-level features and realize feature reuse on the basis of preserving feature layer information maximally. In addition, we generate saliency maps along each feature layer and apply supervisory signals, and the supervised map is passed to the next feature layer to achieve residual refinement for the saliency map.

One of our key contributions is proposal of a mechanism of classification and saliency segmentation joint training. We implement image classification plus segmentation in one model, and the classification information effectively enhances the saliency detection accuracy of the ASI dataset. In particular, it has a remarkable effect on removing normal samples (nonsalient images) with no defects in practical application. We believe that such a multitask model is a useful idea to promote saliency detection in more practical scenes. We conduct a simplified model test and tested our DMS network on 4 different test sets. Result shows the improvement mechanism we proposed increases the effectiveness of the saliency detection to different extents and can be effectively promoted to other ASI datasets.

We tried to employ the atrous spatial pyramid pooling (ASPP) from DeeplabV3 in DMS network, for which may improve the model effectiveness by expanding the convolution layer receptive field. However, the experimental result shows that this operation does not have a positive effect on the task but declined some evaluation metrics instead. After analysis, it may be that most detection objects in ASI saliency datasets are mainly small targets, so the idea of expanding the receptive field to collect global information is not very effective. That is to say, local information plays a more active role in resolving saliency surface defect detection. At present, saliency detection of small target objects is still a recognized difficulty in the field of saliency detection. Therefore, exploring how to solve the saliency segmentation of small task objectives may be one of the key points to further enhance the ASI task.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request. The saliency maps used to support the findings of this study have been deposited in the GitHub repository (https://github.com/Sssssbo/DMS). Previously reported ASI datasets were used to support this study and are available at https://github.com/abin24/Magnetic-tile-defect-datasets, http://faculty.neu.edu.cn/me/songkc/Vision-based_SIS_Steel.html, https://github.com/cuilimeng/CrackForest-dataset, and https://github.com/cuilimeng/CrackForest-dataset. These prior studies (and datasets) are cited at relevant places within the text as references [16, 37, 38, 39].

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Nature Science Foundation of China (no. 51375439).