Abstract

The intelligent imaging sensors in IoT benefit a lot from the continuous renewal of deep neural networks (DNNs). However, the appearance of adversarial examples leads to skepticism about the trustworthiness of DNNs. Malicious perturbations, even unperceivable for humans, lead to incapacitations of a DNN, bringing about the security problem in the information integration of an IoT system. Adversarial example detection is an intuitive solution to judge if an input is malicious before acceptance. However, the existing detection approaches, more or less, have some shortcomings like (1) modifying the network structure, (2) extra training before deployment, and (3) requiring some prior knowledge about attacks. To address these problems, this paper proposes a novel framework to filter out the adversarial perturbations by superimposing the original images with the noises decorated by a new gradient-independent visualization method, namely, score class activation map (Score-CAM). We propose to trim the Gaussian noises in a way with more explicit semantic meaning and stronger explainability, which is different from the previous studies based on intuitive hypotheses or artificial denoisers. Our framework requires no extra training and gradient calculation, which is friendly to embedded devices with only inference capabilities. Extensive experiments demonstrate that the proposed framework is sufficiently general to detect a wide range of attacks and apply it to different models.

1. Introduction

The continuous upgrading of DNNs provides an opportunity to efficiently process the enormous unstructured data generated by the wide-spreading imaging sensors in IoT systems [1, 2]. However, recent studies [35] have shown that deep neural networks (DNNs) are vulnerable to adversarial attacks, which apply subtle and unperceivable perturbations to input examples and can completely fool the deep learning model. According to different attack settings, adversarial attacks have developed various types of attacks, such as white-box attacks [6] and black-box attacks [7]. There are also attacks targeting different application scenarios, such as face recognition [8] and natural language processing [9]. Such attacks seriously threaten the success of deep learning in practice. The defense of adversarial examples is now an important and pressing problem.

According to the manipulation objects, we divide the mainstream defense methods into three categories: (1) enhancing the robustness of deep learning models by modifying the model itself, (2) detecting adversarial examples by independent widgets, and (3) removing the perturbations in adversarial examples directly. Adversarial training [3, 1012] is now the state-of-the-art approach targeting to enhance the robustness of deep models. This method works well in the situation with prior knowledge about attacks yet could fail when facing unknown attacks. Moreover, attackers could deliberately design examples targeting the enhanced models [13, 14].

Some studies are aimed at detecting whether the example is adversarial or not before accepting its prediction label. For example, Tao et al. make a hypothesis that DNNs should rely on human-perceivable attributes alone to make decisions. Even if the invisible attributes play a key role in boosting the DNNs’ accuracy, they are vulnerable to hostile attacks. They propose an attribute-steered model only based on human-perceptible attributes and utilize the prediction inconsistency between the proposed model and the original one to detect adversarial examples [15]. The authors of NIC [16] regard the detection problem as an anomaly detection problem. They use the clean examples to train a one-class support vector machine (OSVM) to detect adversarial examples. However, both of the two detection approaches need to modify the network structure and retrain.

Other studies propose to build denoisers to deal with adversarial attacks. The denoisers could filter out the adversarial noises and work as a robustness-enhancing component for the original deep model. But more often, this approach is directly employed as an adversarial example detector. Feature squeezing [17] is an intuitive denoising method by squeezing the feature space of input images. But the performance highly depends on the quality of the designed squeezing method. MagNet [18] and HGD [19] propose to train the denoiser composed of an encoder and a decoder to remove the substantial adversarial noises in the pictures. Nevertheless, this kind of method may reduce the quality of input pictures, which lowers the accuracy of deep learning models. Training is still another tricky problem. To train a reconstructed network is a skilful and time-consuming task, especially for images with high resolution.

The development of explainable artificial intelligence (XAI) [20, 21] provides an opportunity to reconsider the problem of adversarial examples. Class activation map (CAM) technology, a visualization method on DNN interpretability, has achieved some positive results [2226]. Whichever the attack method it is, the essential purpose is to divert the model’s decision-making attention by adding disturbance to the input, leading to wrong predictions. Since the attack changes the provenance of the model’s decision-making, the visualized interpretation of adversarial examples must be different, more or less, from that corresponding to the normal ones. Therefore, the deviation or derivation thereof could be the critical information to spot malicious examples. Wang and Gong use the features exacted from multilayer saliency maps to train a binary classifier for discerning adversarial pictures [27]. This route requires the acknowledgment of attacks, like adversarial training. Ye et al. propose to directly superimpose the Grad-CAM onto the original image in a specific ratio to mitigate the adversarial perturbations [28]. Yet, the direct addition of Grad-CAM and the original image essentially shifts the mean of pixels and changes the brightness of pictures, resulting in an unnecessary loss of accuracy. Moreover, Grad-CAM itself also has problems such as false confidence and the need for the back-propagation interface (a detailed discussion in Section 3.1.2).

This paper takes the interpretable visualization as the efficient representation of the deviation between adversarial examples and benign and proposes a novel framework for adversarial example detection. Based on our analysis of the influence of malicious examples on the target model, Gaussian white noise is decorated by CAM to generate the mask, which is then superimposed on the original image to denoise the adversarial perturbations. Compared to the state-of-the-art denoiser conducted in [28] based on XAI, a more logical and reasonable method is employed to generate the mask. Besides, a superior CAM, namely, Score-CAM, is utilized to capture the target model’s attention more accurately and to tutor the decoration of Gaussian noise. Overall, the advantages of the proposed framework can be summarized as follows: (1)Based on the derivation with explicit semantic meaning, we directly use the random white noise decorated by Score-CAM to eliminate adversarial features, making the proposed framework more explainable(2)Only inference is needed to compute the Score-CAM, independent of the computation-intensive back-propagation, making the proposed framework friendly to the deploy environments such as intelligent imaging sensors(3)Since the detection results are determined by the prediction inconsistency before and after denoising, the framework can work as an independent component without modifying the original DNN structure or extra training(4)The proposed framework is inspired by the common characteristics of various adversarial attacks. It applies to different attacks without extra data or prior knowledge about attacks, which lowers the deployment costs and broadens the applicable scenarios

Extensive experiments are conducted over several representative attack algorithms toward different DNN models. The experimental results show that our approach can always achieve the highest prediction accuracy and detection success rate. The potential of applying XAI to solve complex adversarial example detection problems is exhibited.

The remainder of this paper is organized as follows. Section 2 introduces the related works about deep learning interpretability and adversarial example detection. In Section 3, the idea to design the detection framework is discussed, and then, the details of the proposed method are brought out. Experiments in Section 4 verify the effectiveness of the framework. Finally, Section 5 presents our conclusions and prospects.

2.1. Interpretability of Deep Learning

Deep learning has achieved great success in many fields [2, 29]. Nevertheless, the end-to-end learning method, which optimizes a large number of parameters through the back-propagation of losses, is similar to a “black box.” It means that deep learning models lack transparency and interpretability. This is a significant drawback in many applications, where the rationale of models’ decisions is a requirement for trust. Although we have built algorithms with extremely high accuracy, we can only get model parameters with unclear meaning in the end. In other words, the deep model itself contains knowledge, but humans cannot understand it. We want to know (in our way) what knowledge the model has learned from the data to make the final decision. Hence, the interpretability of deep learning is of great significance to artificial intelligence. On the one hand, it is an essential means to evaluate the safety of artificial intelligence. On the other hand, it is also conducive to accelerating the promotion of artificial intelligence applications.

Zhou et al. proposed CAM [30], one of the most representative interpretability approaches. CAM is essentially a heat map that depicts the attention information of deep learning models. They found that the weights of the classification layer, i.e., the fully connected (FC) layer after the global average pooling (GAP) layer, were highly correlative to the corresponding categories. Therefore, they propose to use the information contained in the GAP-based structure to derive CAM. In their definition, CAM is the linear weighted sum of the activation maps. For example, consider the structure that an FC layer follows a GAP layer. Let denote the -th channel of activations inputted to the GAP layer. denotes the weight vector of the last FC layer with respect to class , and its -th element is represented by . The CAM of class is defined as where

Based on the above definition, the calculation of CAM depends on the specific structure of the FC and GAP layers. Therefore, a deep model without a GAP layer needs to be modified and retrained. Moreover, the last convolution layer is generally of small size. The CAM must always be resized to the same shape as the input image, leading to coarse spatial information after interpolating.

Grad-CAM generalizes CAM to other models without GAP layers. The core idea of Grad-CAM is to represent the fusion weights, , by gradients. Since the calculation of gradient is independent to GAP layers, Grad-CAM is applicable in any layer. Consider a convolution layer and a class of interest . The prediction probability of class is denoted as . Let denote the activations of layer , while is the -th channel. The spatial shape of is , where and are, respectively, the width and height of the -th layer in the model. The Grad-CAM, denoted as , is defined as where

The fusion weights are defined by the element-wise partial derivatives of with respect to . is adopted to remove the negative values. Grad-CAM is applicable not only for classification problems but also for other models in which the activation function can be derived.

2.2. Adversarial Example Detection

The goal of adversarial example detection is to judge if an input image is malicious. On the basis of whether to modify the input examples, existing works can be divided into two categories: (1) statistics-based and (2) denoiser-based.

2.2.1. Statistics-Based Approaches

Adversarial examples are aimed at distorting the output of their target model. Since a model commits decision divergence with high probability when facing a malicious example, there must be a statistical difference, in the example itself or the process of decision-making, between adversarial examples and the corresponding benign ones. The main idea of statistics-based approaches is to design measurable metrics for the statistical differences between adversarial and benign examples and make them as significant as possible. However, this kind of method needs some prior information, more or less. Nicolae et al. find that the adversarial examples have more significant reconstruction errors compared to the clean ones [31]. They take advantage of CapsNet [32] to reconstruct the input image and train it with loss between the input and the reconstructed image. After training, the reconstruction errors of most adversarial examples are more significant than a threshold. This method works on MNIST, Fashion MNIST, and SVHN. However, for the examples with a low distortion level, its result is not satisfactory. NIC [16] treats the detection task as a one-class classification (OCC) problem. It utilizes a one-class support vector machine (OSVM) model to classify the input images. Additional classification layers connecting to the internal layers of the original model are trained first to extract extra features, with which the OSVM can be learned. This method only needs benign examples for training, requiring no information about attack algorithms.

2.2.2. Denoiser-Based Approaches

The basic idea of denoiser-based approaches is to filter out the possible adversarial noise in the image, without destroying the original semantic information. MagNet [18] uses a reconstruction network to detect adversarial examples, which is similar to [33]. The difference is that the reconstruction network is a combination of an encoder and a decoder. After training the reconstruction network, the reconstructed image and the original image are simultaneously fed into the target model. Then, the Jensen-Shannon divergence (JSD) between the prediction logits of the two images is calculated. If the JSD goes beyond a certain threshold, the input image is considered an adversarial example. The experimental results of MagNet performed well on small sample size datasets, such as MNIST and CIFAR-10. However, Russakovsky et al. found that MagNet failed on ImageNet [34]. They propose a high-level representation guided denoiser (HGD) [19] for large images and achieve state-of-the-art results on ImageNet. Xu et al. propose the state-of-the-art denoiser feature squeezing [17]. The authors consider that the oversized input feature space is redundant for image classification. They propose to squeeze the feature space to reduce unnecessary information. Three methods are employed for denoising: squeezing color bits, median smoothing, and nonlocal smoothing. We believe that the essence of feature squeezing is to disrupt the original pixel distribution with minimal destruction of the original semantic information. However, the performance depends on artificially designed filters. Ye et al. propose a detection framework [28] based on Grad-CAM [22]. In [28], the Grad-CAM of the input image is superimposed onto the input image itself with a particular ratio to generate an emphasized image :

represents the input image and is a hyperparameter. is calculated by formula (3).

Then, the original input image and the emphasized image are simultaneously fed into the same deep model to compare their prediction labels. If the prediction results are not the same, the original input image is considered malicious.

3. Adversarial Example Detection

In this section, we first explore the problem of deep learning models from the perspective of adversarial attack and defense. Based on the discussion, the design philosophy of our work is put forward. Then, the reasons why Score-CAM is chosen are discussed. At last, the algorithm framework and its running procedure are described.

3.1. Design Philosophy
3.1.1. Denoising Motivation

Noise in a specific range in images usually has no harm to the performance of DNNs. There are already mature skills to enhance the robustness of DNN models, including data augmentation, transfer learning, and dropout. A DNN model can be trained to work well in various scenarios with different noise levels.

However, the situation is different for adversarial examples. One of the primary principles of attack methodology is to impact the final output as much as possible by using the slightest change to the input. The level of adversarial distortion will accumulate along with the depth, which has been proved by some previous studies [1, 17, 24, 26]. Slight as the malicious perturbation is, adversarial examples are sensitive to even low-level random noises.

Based on the above discussion, an instinctive idea is to cover the perturbations with random noise. However, superimposing the whole image by random noise with no difference in intensity may cause unnecessary information loss. Therefore, adding appropriate noise with the slightest affection to the benign examples’ accuracy becomes the key to the problem. CAM provides a superior solution for this problem.

CAM is designed for deep learning interpretability, making it a suitable tool to reflect the inside activation state. It reveals the internal information of deep models by visualization method. Given an input image and a class of interest, CAM draws the heat map that indicates the contribution of each area (in the input image) to the prediction score. In other words, it reveals the spatial activation level of a chosen layer. For an unsoiled picture, CAM correctly displays the activation state w.r.t. the ground truth label. For a malignant image that tutors the target model to make a wrong classification decision, the deliberate alteration will change the neurons’ activation mode in the target model. Based on the mistakenly predicted class, CAM will capture the abnormal activation of neurons and express it through the heat map. Figure 1 shows the juxtapositions of CAMs from some benign examples and their corresponding adversarial examples. The visualization results before and after being polluted by adversarial perturbations are displayed in three groups: input images (Figures 1(a) and 1(b)), Grad-CAM (Figures 1(c) and 1(d)), and Score-CAM (Figures 1(e) and 1(f)). It demonstrates that from the perspective of no matter Grad-CAM or Score-CAM, the model’s interesting areas are manipulated by the unperceivable modifications in the input. For example, Figures 1(c) and 1(e) show that the attention of the model is on the area of the main objects (a boy in a go-kart) when the input is original images. But adversarial noises switch the hot zone to the background (Figures 1(d) and 1(f)). Hence, we can exploit the difference of the intermediate information between the unstained and the antagonistic examples to trim the random noise imposed on the detected examples.

We propose to denoise the adversarial perturbations by superimposing the input image with random noise weighted by CAM in the spatial dimension. More specifically, a random Gaussian noise matrix of the same size as the input example is first generated. Afterward, the noise is dot product with the CAM. Finally, the input image (not sure whether clean) is covered by the noise edited by CAM. Consequently, the region with a higher activation level is embedded with higher-level noise after the above transformation. For benign examples, noise covered in the interested area, where the most potential features are located, may lead to a partial loss of information. Nevertheless, the primary semantic information cannot be wrecked if the noise level is controlled to a certain level. The model can still take advantage of the information in the denoised image to make decisions. In contrast, if the input is a poisoned example, the predicted class differs from the ground truth label. CAM will draw the heat map based on the wrong class, where the activated area is different from the area with the wealthiest semantic information. Hence, the edited noise trimmed by the heat map may slightly affect the original area with semantic information. But the distribution of the adversarial perturbations could be distorted more severely. Meanwhile, note that they are deliberately designed to be as small as possible.

The first line in Figure 1 can be a more intuitive example to explain our motivation. As depicted in Figure 1(e), Score-CAM accurately sketches the bird’s contour that contains the wealthiest semantic information. If the input is a clean example (Figure 1(a)), the random noise will cover the bird’s area in line with the result (Figure 1(e)). The decorated noise only hinders the classification slightly based on the previous discussion. In contrast, as for the adversarial example (Figure 1(b)), the attack algorithm switches the high-light zone to the pixels of grassland (shown in Figure 1(f)). When this contaminated example is fed into our framework, the background region is full of emphasized Gaussian noise. Nevertheless, the principal entity, a bird, is barely influenced because we trim the noise’s value according to the Score-CAM.

Section 3.2.1 will hand out a more detailed description of the proposed detection framework.

3.1.2. Choice of CAM

Most methods for extracting CAM are based on gradient. However, gradient-based methods have flawed characteristics and disadvantages to reveal the real attention of DNN models. First, for a DNN model with dozens of layers, gradient vanishing caused by activation functions cannot be ignored. For example, there is the inconsistency of gradient caused by the flat zero-gradient interval in the function, one of the most used activation functions. The inconsistency could bring about high-frequency spatial noises while computing the output gradient for an internal activation map. Second, the gradient is likely to conduct false confidence due to gradient saturation. The area highlighted by the gradient does not always contribute proportional confidence to the result. This phenomenon is discovered by [26]. Last but not least, most real-world deployment environments, e.g., edge computing environments [35], cannot support the gradient computation of deep models. Moreover, neural network quantization is also widely utilized for deep model deployment, resulting in higher complexity and more significant error in computing gradient. The above facts mean that gradient-based techniques, like Grad-CAM, are not universally applicable.

Score-CAM [26] adopts gradient-free method to design the fusion weights, i.e., . It introduces the concept of channel-wise increase of confidence (CIC) to measure the importance of the activation map in each channel. It utilizes the image masked by the activation in each channel to compute CIC. The linear sum of activations weighted by CIC is further calculated. Given a DNN model and a class of interest , the function takes an image and outputs a scalar that represents the output probability for class . Let denote the activation of the -th convolutional layer, and denotes the -th channel of . The Score-CAM of class is formulated by two steps: (1)Computing CIC

Considering a known baseline input , the contribution of towards is defined as where

In this paper, is a zero matrix with the same size of . denotes upsampling to the same spatial size as original input . is a min-max normalization function that limits the raw activation values in . (2)Computing Score-CAM

In the process of calculating Score-CAM, is the weighted mask for -th channel. By applying the weighted masks to the original activation maps, we can get the Score-CAM:

The function is used for the disturbance of irrelevant pixels on the activation map.

In the experiments conducted by [26], Score-CAM performs better than Grad-CAM [22] and Grad-CAM++ [23] in no matter the visualization of the heat map or the quantitative evaluation. The visualized results of these two CAM methods are depicted in Figure 1. Figures 1(c) and 1(e), respectively, show the results of Grad-CAM and Score-CAM. Score-CAM can always highlight the main objects and suppress the noise in the background area, while Grad-CAM obtains the inaccurately activated heat map on most occasions. Combining the above analysis, we believe that Score-CAM can play a better role than the methods based on the gradient in the adversarial defense.

3.2. Detection Framework Design
3.2.1. Detecting the Adversarial Examples

At present, we have obtained the activated map containing the attention information of the model. The next question is how to use this information to distinguish out the vicious examples.

Our approach is to denoise the adversarial perturbations with decorated noise. We depict the detection framework in Figure 2.

The computation process of the denoised image from an original image can be formulated as:

First, we generate a noise matrix with the same shape as . Then, we compute the weighted noise by dot-multiplying the noise matrix and the Score-CAM of w.r.t. the class of interest , i.e., . In this paper, we adopt the class with the highest predicted probability as the class of interest, and the Score-CAM is default resized to the same shape as the input image and the noise matrix. Last, the weighted noise and original image are added to generate a new image called edited image. Here, we directly trim the pixel values beyond [0,255]. We utilize random Gaussian noises with zero mean value and an adjustable standard deviation .

This method introduces randomization to the defense side to lower the possibility of being bypassed by targeted malicious attacks. Besides, it does not shift the mean value of the original pixels’ distribution and does not severely degrade the prediction accuracy of the models.

The last part of the detection framework is the mechanism of result determination. Based on the discussion in Section 3.1, noise with a limited level only weakly affects the recognition of the benign example. On the contrary, weak noises can lead to the failure of adversarial perturbations since they are designed to be as unperceivable and tiny as possible. The prediction results of the original image and the edited image will be compared to judge if is adversarial. If the two images correspond to different prediction labels, the original image is determined an adversarial example. On the contrary, consistent prediction of the two images hints at a clean example.

3.2.2. Setup for Score-CAM

Different attack algorithms behave differently in altering input pictures and manipulating cells. Some adversarial algorithms such as attacks limit the magnitude of changed pixels rather than pixel numbers. Malicious examples tend to activate large numbers of neurons abnormal to the actual labels. By accumulating a considerable amount of tiny deviations, qualitative change happens and the prediction label changes. On the contrary, attacks limit the number of pixels modified. They tend to exploit a few amplifier paths and lead to a decisive change in deeper layers. Most attacks exploit both aforementioned ways, such as attack, which constrains the total change using the Euclidean distance to produce more unperceivable perturbations. For both and attacks, the shallow layers in the target model often do not accumulate significant adversarial disturbances. The drastic changes may occur in a deeper layer. Therefore, shallow layers are not the ideal targets for extracting CAM in our work.

In the process of calculating Score-CAM, activation maps are upsampled to the same spatial size as the input image. After that, the resized activation maps will be used as the mask onto the input image. However, it is a “first-line therapy” to reduce the spatial size along the inference direction when designing a convolution network. For example, the size of activation maps in the last convolution layer of ResNet50 is 77 while inputting an image with the size of 224224. According to formula (8), the output size of Score-CAM is dependent on the spatial size of the -th layer’s activation map. So, the size of Score-CAM is usually smaller than the input image. However, the Score-CAM will be resized to the shape of the input image according to formula (9) by using the interpolation algorithm (nearest-neighbor interpolation in our implementation). Therefore, the spatial information is too coarse for extracting Score-CAM if we use the activation map from the very deep layers.

After the above discussion, we can conclude that the layers in the middle of a model are most appropriate for our framework. The specific layer names and the size of the activation maps are listed in Table 1. We also conduct an ablation experiment to verify our inference in Section 4.3.

4. Evaluation

In this section, we conduct experiments to evaluate the effectiveness of the proposed detection framework.

4.1. Implementation Details
4.1.1. Dataset and Models

We conduct experiments on ILSVRC2012 samples from ImageNet [34], one of the most representative colored image datasets for computer vision tasks. Several prevalent DNNs are chosen as the target models, including ResNet50, ResNet101, DenseNet201, Xception, and InceptionV3. They are recently the most prevailing architectures and are used as backbone networks in all kinds of computer vision tasks, such as face recognition, semantic segmentation, and object detection. The pretrained model weights and preprocessing API come from Keras. Eight-bit images are converted to float matrixes. Afterward, the alterations in our experiments are directly conducted on these matrixes with restricted values from 0 to 255.

4.1.2. Attack Setup

The adversarial examples are generated based on the images which are correctly classified by the target models from ILSVRC2012-val. For each target model and each attack algorithm, we select 500 successful adversarial examples and the corresponding original images as the test data.

In other words, every detection experiment is conducted on a test set containing 500 benign examples and 500 corresponding adversarial examples.

Our experiments are conducted with several representative white-box adversarial attack algorithms: FGSM [3], BIM [36], PGD [10], and CW attacks [6]. Attacks with different norms are also taken into consideration. In this paper, we adopt untargeted attacks for our experiments.

Low-level disturbance for adversarial examples is one of the development targets. To avoid generating coarse adversarial image examples, we tune the hyperparameters of attacks carefully and keep the attack success rate around 90%. We adopt the implementations from the ART library [31]. The details of the attacks and target models are listed in Table 1.

4.2. Adversarial Example Detection

Table 2 shows the experimental results on adversarial example detection of the proposed method (weighted noise (WN) with Score-CAM, written as Score-CAM+WN for abbreviation) versus the state-of-the-art method (emphasized image (EI) with Grad-CAM, written as Grad-CAM+EI for abbreviation) proposed by Ye et al. [28]. For the completeness of the experiment, we also introduce two other ablation experiments, i.e., Score-CAM+EI and Grad-CAM+WN. The hyperparameter denotes the standard deviation of Gaussian white noise employed only in WN, and is the proportion of CAM emphasized to an image used only in EI.

For all the above methods, editing the input examples disturbs the original pixel distributions, leading to accuracy degradation on the original benign examples. This accuracy is called Original Samples Prediction Accuracy (OSPA). In our experiments, OSPA is 100% when the hyperparameter or equals zero. It is because the chosen examples are not edited at this time, and all of them can be correctly classified. OSPA will decrease along with the increase of or . To fairly compare the effectiveness of different approaches, OSPA is adjusted to 90% (±0.5%) for different experiments by tuning or of the corresponding method.

The experiments consist of 30 groups: 6 attacks 5 models. The left-most column shows the attack name and its norm type. For example, CW () indicates the Carlini and Wagner attack with norm. The top row shows the names of the six target models. We demonstrate three values for each experiment: hyperparameter, adversarial example accuracy (adversaries’ accuracy), and detection success rate (success rate). The detection success rate is the percentage of the examples with different prediction labels before and after being edited in 500 adversarial examples. Adversarial example accuracy is the prediction accuracy of adversarial examples after being denoised.

Since random noises are introduced into the detection framework, the results are not the same for each time. Therefore, 10-fold testing is applied in the WN method. For each experiment introducing random noises, the final result is the average value of 10 times repeat.

As shown in Table 2, except for the FGSM attack, the detection success rate of the proposed method reaches more than 60% in most cases. When facing FGSM attacks, there is a drop in the success rate. We believe that the FGSM attack is relatively coarse. So greater distortion level (greater step size) is needed to maintain the attack success rate of 90%. To maintain the attack success rate of 90% in our experiments, greater step size is adopted and higher-level distortion is added. The decorated weighted noises with the same or could not decompose the adversarial perturbations.

The very noticeable point is that the proposed method (Score-CAM+WN) achieves a higher success rate than the baseline (Grad-CAM+EI) in almost all the cases. Even in the experiments where the proposed method has poorer performance (CW () and Xception), its gap to the best is insignificant. It proves that the proposed method is more sensitive to adversarial examples. Score-CAM always performs better by comparing the results of the same CAM type but the different superimposing methods. By comparing the results of the same superimposing method but different CAM types, WN always performs better. The data of adversarial examples accuracy shows a similar pattern.

Considering that no training is carried out before deployment, the proposed method achieves quite impressive results. Furthermore, it works for different attacks and various models, demonstrating its generality.

4.3. Choice of Layer

In this section, we validate the analysis and discussion about different layers in Section 3.2.2. Activations from different layers are utilized to generate Score-CAM. Furtherly, the images are edited by WN. ResNet101 is chosen as the target model. The adversarial examples are produced by BIM () and PGD (), as described in Table 1. Five layers are picked out for this evaluation. Each layer is the output of the last one in the bottleneck blocks with the same shape. For example, there are four bottleneck blocks with an output shape of 2828512: conv3_block1 to conv3_block4, and conv3_block4 is the last one.

As shown in Table 3, conv3_block4_out with the output shape of 2828512 performs best. The defense results rise first and then descend along with the reduction of the spatial size. It is fully in line with our previous analysis in Section 3.2.2.

Another noticeable phenomenon is that descends with shrinking spatial size, in general. Since we keep the OSPA at 90% for all experiments, this phenomenon indicates that a lower noise level is needed to maintain OSPA when using Score-CAM with a smaller spatial size.

4.4. The Trade-Off between OSPA and Success Rate

In this section, we provide a survey of the relationship between OSPA and detection success rate. The adversarial examples are still produced on ResNet101 by BIM () and PGD (). The attack configuration is the same as described in Table 1. As shown in Table 4, our method reaches more than a 42% success rate at the OSPA of 90% for both BIM and PGD attacks. The success rate improves along with the descend of OSPA. However, the accuracy of adversarial examples first increases and then decreases. Decorated noise added to contaminated images can mitigate the adverse effects of adversarial perturbations. Hence, the adversaries’ accuracy increases first. However, DNN can only filter out random noise within a certain limit. When the noise power is too large, the original semantic information will be wrecked. This leads to a drop in the adversaries’ accuracy.

5. Conclusion

In this paper, we propose a gradient-independent adversarial example detection framework based on the technique of deep learning interpretability. Based on the discussion, we conclude that adversarial examples are sensitive to random noise while clean ones are not. We cover the perturbations with decorated random noise by taking advantage of this property. The random noise is decorated based on the example-wise Score-CAM to emphasize the area where the target model really focused and to eliminate unnecessary accuracy loss. Extensive experimental results show that the proposed framework can always achieve the highest prediction accuracy and detection success rate compared with previous works. We further make ablation experiments to explore the impact of Score-CAM from different layers and find that the middle layer of models is most suitable to extract Score-CAM. In addition, we also investigate the trade-off between clean data accuracy and detection success rate. We believe that our framework can be easily updated when more accurate and efficient saliency map methods emerge.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 62173066).