Abstract

Deep neural networks (DNNs) have been closely related to the Pandora’s box from the moment of its birth. Although it achieves a high accuracy significantly in real-world tasks (e.g., object detecting and speech recognition), it still retains fatal vulnerabilities and flaws. Malicious attackers can manipulate DNN model misclassification just by adding tiny perturbations to the original image. These crafted samples are also called adversarial examples. One of the effective defense methods is to detect them before feeding them into the model. In this paper, we delve into the representation of adversarial examples in the original spatial and spectral domains. By qualitative and quantitative analysis, it is confirmed that the high-level representation and high-frequency components of abnormal samples contain richer discriminative information. To further explore the influence mechanism between the two factors, we perform an ablation study and the results show a win-win effect. Utilizing the finding, a detecting method (HLFD) is proposed based on extracting high-level representation and high-frequency components. Compared with other state-of-the-art detection methods, we achieve a better detection performance in most scenarios via a series of experiments conducted on MNIST, CIFAR-10, CIFAR-100, SVHN, and Tiny-ImageNet. In particular, we improve detection rates by a large margin on DeepFool and CW attacks.

1. Introduction

Exploring the inherent patterns and changing trends of data is an eternal subject for all human beings. The traditional way of processing data is driven by rules which obtain through experience or manual summary. The advent of DNNs makes it possible to process the data automatically. Relying on the characteristic, it has recently been widely applied in data-sensitive fields, such as financial payment [1], medical assistance [2], and satellite remote sensing [3]. On the contrary, the benefit of its automation also brings the property of a black-box [46], which means almost everything inside the network is unknown to us. Against this background, Goodfellow et al. found in [7] that adversarial examples, imperceptible for a human observer, produced by adding crafted perturbations to benign images were able to misclassify the DNNs model with high confidence. There is no doubt that how to defend against these out-of-distribution samples has become a top priority subject.

Before implementing adversarial defenses, exploring the intrinsic properties of adversarial examples is crucial. At present, there are two main explanations for the appearance of adversarial samples: low-probability regions in manifold [8] and linear explanations [7]. These studies all focused on the distribution of spatial probability (i.e., statistical probability) to make them reasonable. However, the latest studies in [915] indicated that adversarial examples are mainly concentrated in the high-frequency region. Moreover, Ilyas et al. further illustrated in [16] that adversarial examples are not even bugs. They are non-robust features, which means we can learn them via training a DNN model.

Inspired by these studies, we initiate to explore the distinction of the representation of adversarial examples in the original spatial and the spectral domain. In fact, the spatial and the spectral domain refer to the original samples and the original samples after Fourier transform, respectively. Intuitively, we discover that adversarial examples have many small black dots in the mid-high frequency region. By performing cluster analysis, we further discover that adversarial examples can be better classified in the spectral domain. For promoting the detection performance, we further analyze the impact of different layers of network and frequency bands on the detection performance. The experimental results surprisingly demonstrate that extracting high-level representation and high-frequency components can improve the detection performance significantly.

In this paper, we propose the detection method HLFD to detect the abnormal samples based on extracting high-level representation and high-frequency components. Using the high-level feature maps of the model as input, we transform them to the spectrum by Fourier transform and extract the high-frequency components. A detector with an ideal performance will be born after training these transformed data. Compared with other defense methods, there is no need to alter the network architecture and the lower computational cost, which are its superiority. The overview of the detection model is shown in Figure 1.

We evaluate our method on six different attacks, the fast gradient sign method (FGSM) [7], two of its variants, the basic iterative method (BIM) [17], the projected gradient descent (PGD) [18], Jacobian-based Saliency Map Attack (JSMA) [19], Carlini and Wagner (CW) [20], and DeepFool [21] methods. Using only one of the tricks of high-frequency extraction or high-level representation cannot achieve the ideal detection performance on DeepFool and CW, although we perform well in FGSM, BIM, PGD, and JSMA attacks. Considering that high-level representation and high-frequency components may affect each other, we further perform an ablation study for the two factors. Experimental result shows a win-win effect which means a better performance after applying the two tricks. DeepFool and CW attacks can be detected efficiently by employing the two factors simultaneously. For a more rigorous conclusion, the detector is evaluated on five datasets: MNIST, CIFAR-10, CIFAR-100, SVHN, and Tiny-ImageNet (aka T-ImageNet).

For a fair comparison, adversarial examples are restricted to similar norm values and employ three state-of-the-art detectors, kernel density and Bayesian uncertainly (KD + BU) [22], local intrinsic dimensionality (LID) [23], and Mahalanobis distance (M-D) [24] as a contrast. Our method outperforms the other three detection methods for each attack on CIFAR-10 and CIFAR-100. Although our detection rate is not the highest in some scenarios, the gap is not large at least. Moreover, we improve the detection rates by a large margin on DeepFool and CW attacks.

In particular, our main contributions are as follows:(i)We intuitively and experimentally prove that the spectral samples have richer discriminative details, which can be effectively distinguished by the detector.(ii)The ablation study shows that the detection performance can be improved effectively whether using high-level representation or high-frequency extraction.(iii)An effective method for detecting adversarial examples is proposed, which performs better in most scenarios compared to the state-of-the-art.

In this section, we will briefly introduce several state-of-the-art methods for adversarial attacks and adversarial defenses.

2.1. Adversarial Attack
2.1.1. Fast Gradient Sign Method (FGSM) [7]

Goodfellow et al. purposed fast gradient sign method (FGSM), one of the simplest ways based on gradient, to generate adversarial examples. One can obtain them only by computing the direction of the gradient of the loss function. The method can be expressed as

2.1.2. Basic Iterative Method (BIM) [17]

In most cases, adversarial samples generated solely by the FGSM method are ineffective. As a variant of FGSM, it performs multiple gradient computations on the direction of loss function and can be represented as follows:where , refer to the number of iterations and the step size of each iteration, respectively, and means the cross-entropy loss function on a given image X and label Y. The norm of is limited to by the clipping function.

2.1.3. Projected Gradient Descent (PGD) [18]

PGD is an advancement of BIM, which alters an initialized with uniform random noise.

2.1.4. Jacobian-Based Saliency Map Attack (JSMA) [19]

Utilizing the Jacobian matrix, Papernot et al. put forward Jacobian-based Saliency Map Attack (JSMA). It mainly adopts the priori probability of the last layer to backpropagate, thereby obtaining the corresponding gradient information.where is the class to be attacked. By constructing a saliency map , the pixels which contribute the most to the result can be found. Adversarial examples can be generated via exploiting the information.

2.1.5. Carlini and Wagner (CW) [20]

CW abstracts the adversarial attack task into an optimization problem. On the basis of guaranteeing the model misclassify, constantly seeking for the smallest adversarial perturbation is its key idea. We formulate it as follows:where is the output of the pre-softmax layer. Utilizing the function, map the to , thereby avoiding the loss caused by truncation.

2.1.6. DeepFool [21]

Compared with other attack methods, DeepFool is known for its minimal disturbances, which makes adversarial examples harder to detect. It stops seeking when the samples just cross the decision boundary as described by the formula below:where is actually the smallest perturbation.

We apply a python tool Foolbox [25] to generate adversarial examples for all attack methods. An intuitive rendering of the comparison of various attack methods is represented in Figure 2.

2.2. Adversarial Defense and Detection

In general, adversarial defenses can be mainly divided into two categories. One strategy is to modify the architecture or parameters of the network, and the other is to defend by preprocessing benign images before feeding them into the model. Adversarial training belonging to the first strategy [26, 27] has achieved great success in the defensive areas. The model will be retrained on normal and abnormal samples to learn the decision boundary details, thereby avoiding misclassification and having stronger robustness. Nonetheless, massive data support and ineffectiveness against specialized attacks have made it less attractive. It did not take long for adversarial distillation [28] to be raised. Although the experiment conducted on small datasets showed that it can defend against adversarial examples effectively, it was limited to be used in DNN models with probability distribution vectors. Methods belonging to the latter, like JPG image compression [29], rejecting classification [30], and detecting [2224, 3133] are trying to eliminate the abnormal statistical characteristics before poisoning the model. The defense methods without touching the training process are undoubtedly more exciting.

As one of the approaches in the defensive field, adversarial detection has attracted the attention of scholars due to its higher flexibility and lower computation. Sample statistics and training a detector are two main routes. Exploiting the statistical properties, Feinman et al. dived into the kernel density (KD) and Bayesian uncertainly (BU) in the hidden layers of the model and purposed an effective detection method in [22]. Ma et al. further applied local intrinsic dimension (LID) in [23] to describe the intrinsic characteristics of adversarial subspaces. Considering that the information of the last layer may not be enough to judge the out-of-distribution data, Lee et al. in [24] made full use of each layer of DNN and obtained a detector via calculating Mahalanobis distance (M-D). Hendrycks and Gimpel indicated in [34] that samples with a large principal component had higher weights to attack successfully. Still, the latest research showed that DNNs were sensitive to the direction of the Fourier basis function. In [9, 10], it was found that the high-frequency components of adversarial examples affected seriously on the robustness of the model. On the assumption that each layer of DNN obeyed the generalized Gaussian distribution, Ma et al. in [35] calculated the Benford-Fourier coefficients of each layer, thereby obtaining a support vector machine with an ideal detection performance.

3. Methodology

In this section, we will introduce the detection mechanism in detail to identify adversarial examples.

3.1. Threat Model

Existing studies are constantly exploring how to generate adversarial examples. Fortunately, we can summarize into two main points: make the model misclassify and minimize the disturbance as much as possible. Suppose as a n-dimensional input image, as a perturbation of , and as a model trained by . If , we can define that is the adversarial example specified to model and normal image . However, if the perturbation is so large that neither the model nor the human eye can correctly identify, the adversarial example has become no practical significance. Hence, our objective shall be expressed likewhere means the th dimensional value of and is the norm of . In general, the norm is frequently utilized to limit the increase in perturbation. In this paper, we apply the norm for all adversarial attacks.

For a detection task, suppose as a detector which is a binary classifier essentially. An ideal detector can classify normal images as label 0 (i.e., ) and abnormal images as label 1 (i.e., ). We formulate this objective as follows:where refers to the number of samples and represents the result of feeding into detector . If , return true, otherwise returns false. is exactly the detector we seek and the maximizing result will be employed as one of our evaluation metrics, which is also called detection accuracy.

3.2. High-Level Representation

As shown in (7), making the detector identify as many samples as possible is our objective. In general, there are two strategies to achieve the purpose: transform the input data and alter the internal structure of the detector. Through the subsequent experiments shown in Figure 8, it was found that altering the detector model would not improve the detection performance significantly. Hence, feature engineering on the input data turns into our principal subject. Surprisingly, a tiny attempt to extract high-level representation from the raw data breaks the technical difficulty. As Harder et al. illustrated in [11], high-level representation has more stable and robust discriminative details for adversarial detection.

Suppose as a DNN model trained by , we can obtain the th feature map via calculating the simply. Using instead of as the input data, (7) will have a higher value, which means better detection accuracy under the condition of the same training time.

3.3. Fourier Transform and High-Frequency Extraction
3.3.1. Fourier Transform

In general, Fourier transform is resorted to transform signals between the time domain (or spatial domain) and the frequency domain. After converting data into the spectrum, multiply characteristics hidden in the spatial domain are revealed. The low-frequency components correspond to slowly changing regions (i.e., the flat regions), while the high-frequency components do the opposite (i.e., the edges or noise). Exploiting these properties, we can obtain blurred or edge sharpened images, respectively, by suppressing high or low-frequency components. Still, unlike continuous mathematical signals, images are discrete data consisting of pixels, which means we shall convert it using discrete Fourier transform (DFT). For a low computational cost, we employ the fast Fourier transform (FFT) [36] which has a time complexity of . Suppose an image , where represent the width and height of the image, respectively. We can acquire the Fourier coefficient by the following formula:where and refers to the pixel value of the coordinate . is actually a complex matrix with the same size as image . The magnitude matrix will be acquired via calculating the following formula:where and refer to the real and imaginary parts, respectively. In subsequent experiments, the magnitude of the spectrum will be applied to represented the spectral domain.

3.3.2. High-Frequency Extraction

Due to the conjugate property of FFT, its effective spectrum only accounts a quarter of for a two-dimensional image. We divide the effective spectrum into four parts (a), (b), (c), and (d), as shown in Figure 3, according to low, medium-low, medium-high, and high-frequency, respectively. For a fair division, it is necessary to ensure that each part occupies 25% pixels of the image. Hence, we will introduce a threshold function that separates the frequency components according to the radius . Suppose the effective spectrum , the formal definition of equation is as follows: where represents the effective spectrum at position and is exactly the lower left of the effective spectrum. refers to the Euclidean distance. By calculating as follows, we can obtain each frequency band simply.where is the boundary value for quartering the matrix . Hence, low-frequency component can be obtained via calculating where , . Learning by analogy, medium-low, medium-high, and high-frequency can be obtained by computing , , and , respectively.

3.4. HLFD Detection Method
Input:
Original samples,
Adversarial samples
The number of samples
Selected high-level representation layer
Model trained by
Output:
Detector
1For in do
2# the th feature map
3XL
4XadvL
5XLH getHighFrequencyComponent(XL)
6XadvLH getHighFrequencyComponent(XL)
7XLH_list.append(XLH.flatten())
8XadvLH_list.append(XadvLH.flatten())
9End
10Trained with XLH_list, XadvLH_list
11Return

As shown in Figure 4, we divide the HLFD detection method into three parts: extracting high-level representation, extracting high-frequency components, and training process. Inputting the normal and abnormal samples , , we can obtain the th feature map of model via calculating , . According to the experimental result in section 4.2, further converting Mm(X),Mm(Xadv) to spectral domain makes the model have a greater improvement in detection tasks. Thus, we employ the Fourier transform to acquire the spectral characteristics , and further obtain high-frequency components , by equation (11). As emphasized in the previous paragraphs, feature engineering is the key to our HLFD method. Whether the detector is logistic regression [37], support vector machine [38], or neural network model, we can obtain a better detection performance as long as using , as input. Both Figures 5 and 6 illustrated this conclusion well by intuition and experiment, respectively. More specific procedures can be acquired in Algorithm 1.

4. Experiment

In this section, we will rigorously conduct experiments to demonstrate the effectiveness of our detection method. We initiate with a basic experimental setup and explore the discrepancy between the spatial and spectral domains. To improve the detection performance, it is indispensable to further explore the impact of the representations of different layers, different frequency bands, and different detectors on a detection task. At last, we will conduct an ablation study and compare our method with the existing state-of-the-art methods.

4.1. Experimental Setup

For a more generalized conclusion, we conduct a series of experiments on five datasets (MNIST, CIFAR-10, CIFAR-100, SVHN, and T-ImageNet) and six attack methods (FGSM, BIM, PGD, JSMA, CW, and DeepFool). We employ open source pretrained models for convenience, which achieve 98.4%, 93.7%, 74.2%, 96.0%, and 51.2% accuracy on MNIST, CIFAR-10, CIFAR-100, SVHN, and T-ImageNet, respectively. To obtain corresponding adversarial examples, we only select samples that can attack the pretrained models successfully. An average of 10,000 adversarial examples are generated for each dataset, which means we can obtain 20,000 samples after adding the number of normal samples. We split them into training set (64%), validation set (16%), and test set (20%), and apply the detection rate ACC (accuracy) and AUC (area under curve) as the evaluation metrics. All adversarial examples are generated by a python tool Foolbox [25]. For a fair comparison, each pixel is changed by an average of 10%. For MNIST,  = 2.8. For SVHN, CIFAR-10, CIFAR-100, . Similarly, for T-ImageNet. The norm is used to limit the size of the perturbation in subsequent experiments.

4.2. Spatial Vs. Spectral Domain

The spatial and the spectral domain refer to the original samples and the original samples after Fourier Transform, respectively. To intuitively observe the discrepancy between the spatial and spectral domains, we make a visual diagram shown in Figure 5. It seems that the pixel distribution of adversarial examples is discontinuous in the spatial domain, which is caused by random perturbation. Although humans can recognize the difference between normal and abnormal samples, the machine is hard to learn the pattern since the distribution is not generalized and stable. On the contrary, adversarial examples in the spectral domain have many small black dots in the mid-high frequency region, which express fixed and generalized patterns. The pattern may be effective for training detectors. To obtain a more rigorous conclusion, we further perform cluster analysis in the spatial and spectral domains as shown in Figure 6. In the first column, it seems that normal and abnormal samples cannot be separated by clustering whether in the spatial or the spectral domain. Still, it is not hard to discover that normal and abnormal samples are gradually classified as deepening of the network layer. Despite illustrating the effectiveness of high-level representation, we find that data in the spectral domain can be linearly separated, which cannot implement in the spatial domain. The phenomenon demonstrates the effectiveness of the spectral domain in a sense. To further explore the performance of spatial domain data on detection tasks, we conduct a series of experiments on CIFAR-10. As shown in Table 1, although spatial data are effective for FGSM, PGD, BIM, and JSMA attacks, they are powerless in CW and DeepFool attacks. It is possible that the perturbations generated by these two attack methods are little and just cross the decision boundary which are harsh to detect in the spatial domain.

4.3. Influence of High-Level Representation

The above experiments reveal the effectiveness of high-level representation and high-frequency extraction qualitatively. Yet, it is not clear how high-level representation affects the detection task. For this, we further conduct experiments to explore the impact of representations of different layers. As shown in Figure 7, the detection rate gradually increases as the deepening of the network layer, although decreases occasionally. Concentrating on CW and DeepFool, which are harsh to detect in the spatial domain, we can also detect them effectively after high-level representation. Nonetheless, we are still not sure which layer works best for the detector. For insurance, we support extracting the last two or three layers for aggregation. An intuitive understanding of why high-level representations work is that these features are incomprehensible (i.e., nonrobust) to humans and are extracted as the deepening of network layers. However, both robust and nonrobust features are crucial for model training, as Ilyas et al. illustrated in [16]. This is exactly the reason why clustering analysis can separate them farther and farther as deepening of network layers, as shown in Figure 6.

4.4. Influence of High-Frequency

Wang et al. illustrated in [9] that high-frequency components can affect model perception and further proposed that high-frequency regions are correlated with semantic components of images. Inspired by this, we initiate to explore the impact of different frequency bands on the detection performance. As the experimental results shown in Tables 2 and 3, high-frequency regions can indeed promote the performance of the detector to a certain extent. However, which frequency bands are considered high is an issue. Although we obtain the highest detection rate from 3/4 to 4/4 frequency bands (high-frequency component) as shown in Table 2, 2/4 to 4/4 frequency bands (mid-high and high-frequency components) acquire highest detection rate as shown in Table 3. For insurance, we suggest frequency bands from 2/4 to 4/4 as the output of extracting high-frequency. The high-frequency components actually correspond to the part of the image that changes drastically and the perturbation is the same. The commonality makes the high-frequency components contain more perturbation detail, which is effective for detecting. The method for frequency band division can be found in equations (10) and (11).

4.5. Influence of Different Detectors

Although the input data is crucial, the choice of the detector model also has an impact on the detection results. We apply three classifier models: LR [37], SVM [38], and simple neural network for comparison. As shown in Figure 8, it is not hard to comprehend that since the detector is trained on CW, it works well for detecting the CW attack. However, the three classifiers are less regular in the detection performance. Therefore, we believe that it is inefficient to promote the detection performance significantly via altering the model structure. The result further confirms the rationality of concentrating on feature engineering.

4.6. Ablation Study

To explore the influence mechanism between the representation of different layers and the frequency bands, we perform an ablation study on them. As shown in Figure 9, the low-frequency of the original images (i.e., the left brown red pillar) are viewed as the benchmark of the detection rate. Two dimensions, the layer of the feature map and the interval of the frequency band, are taken into account in the experiment. For verifying the effectiveness of high-level representation and high-frequency extraction, we compared the benchmark results with the two dimensions. For a fair comparison, the norm of each image is controlled around 5.5 and the SVHN dataset is used here. From Figure 9, we can observe that the detection rates show an upward trend whether only considering different layers of the network or frequency bands. The result shows that there is a win-win effect between the representations of different layers and frequency bands, which further confirms the effectiveness of our HLFD method. It is also found that an 83% detection rate can be achieved even on DeepFool, which is harsh to detect in the spatial domain.

4.7. Comparison with Existing Methods

We compare our method with three state-of-the-art detection methods (KD + PU, LID, and M-D). To obtain a more objective conclusion, we conduct a series of experiments on five datasets and evaluate on six attacks. For a fair comparison, we set the same perturbation for each attack. As shown in Table 4, our method outperforms the other three detection methods for each attack on CIFAR-10 and CIFAR-100. To make the results more convincing, we use more realistic datasets (T-ImageNet) for testing. Although our detection rate is not the highest in some scenarios, the gap is not large at least. Moreover, we improve the detection rates by a large margin on DeepFool and CW attacks. Overall, our HLFD method is more robust and stable in various real-world environments compared to the existing state-of-the-art methods.

5. Conclusion

In this paper, we propose a simple yet effective HLFD method for detecting adversarial examples. By exploring from the spatial to the spectral domain, it is found that adversarial examples after transforming to the spectrum have richer characteristics which are beneficial for training the detector. Moreover, we further discover that extracting high-level representations and high-frequency components can promote the detection performance and the two factors show a win-win relationship via the ablation study. We intuitively and experimentally explain why these two factors work. Exploiting these findings, HLFD detection method is proposed. Although our method outperforms other state-of-the-art adversarial detection methods in most scenarios, the detectors are still faced with a more complex and unknown attacks in a real-world environment. Extending our method to more realistic settings (e.g., ImageNet dataset) is crucial. Exploring how to detect more aggressive attacks effectively are also a worthwhile research subject.

Data Availability

The public datasets can be downloaded from https://paperswithcode.com/dataset/mnist, https://www.cs.toronto.edu/~kriz/cifar.html and https://paperswithcode.com/dataset/svhn. The pretrained model can be downloaded from https://github.com/aaron-xichen/pytorch-playground.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Key Research and Development Program of China (No. 2021YFB2700600) and the National Natural Science Foundation of China Enterprise Innovation and Development Joint Fund (No. U19B2044).