Abstract

Although neural networks are near achieving performance similar to humans in many tasks, they are susceptible to adversarial attacks in the form of a small, intentionally designed perturbation, which could lead to misclassifications. The best defense against these attacks, so far, is adversarial training (AT), which improves a model’s robustness by augmenting the training data with adversarial examples. However, AT usually decreases the model’s accuracy on clean samples and could overfit to a specific attack, inhibiting its ability to generalize to new attacks. In this paper, we investigate the usage of domain adaptation to enhance AT’s performance. We propose a novel multiple adversarial domain adaptation (MADA) method, which looks at this problem as a domain adaptation task to discover robust features. Specifically, we use adversarial learning to learn features that are domain-invariant between multiple adversarial domains and the clean domain. We evaluated MADA on MNIST and CIFAR-10 datasets with multiple adversarial attacks during training and testing. The results of our experiments show that MADA is superior to AT on adversarial samples by about 4% on average and on clean samples by about 1% on average.

1. Introduction

Machine learning (ML) and deep learning (DL) have achieved remarkable performance in providing intelligent solutions in different domains [1, 2]. Nevertheless, ML and DL systems have shown susceptibility to adversarial attacks in the form of small purposely created perturbations leading to misclassifications, which could render ML models useless, especially in security applications [36]. Moreover, the generated adversarial examples for one model can be transferred to attack other models [7]. The field of adversarial machine learning got popular over the past few years, and many defense methods were proposed to protect models against adversarial attacks. Among these methods, adversarial training (AT) [8] is the most popular defense, which works by training the model not only on clean samples but on generated adversarial samples as well.

Although AT has been shown to help increase the robustness of deep learning models, it has some drawbacks. More specifically, AT might lead to overfitting on the used attacks, making the model robust only against the seen attacks and failing to generalize against unseen attacks or new methods [9]. Furthermore, AT increases the model’s robustness at the expense of decreasing its accuracy on clean data. To mitigate these drawbacks, our goal is to use methods of domain adaptation to reduce the gap between the adversarial distribution and the clean distribution. We aim at learning robust features that are domain-invariant between the clean domain and the adversarial domain. In this regard, we refer to clean samples as the source domain and adversarial samples as the target domain.

Generally, domain adaptation methods take advantage of the immense development in adversarial learning techniques to learn domain-invariant features by minimizing the statistical distance between the source and target domains [10]. These methods essentially work by aligning the global distributions of source and target domains, without exploring the underlying complicated multimodal nature of these distributions [11]. This could lead to an irrelevant alignment of the domain distributions [12], especially in diverse domain adaptation scenarios such as adversarial samples. Even if the global adversarial and clean domains have been aligned correctly, the adversarial and clean samples with the same label could still be mapped far from each other in the feature space. Therefore, the main points to consider when employing domain adaptation techniques for solving the problem of adversarial attacks are(1)Learning domain-invariant robust features by maximally matching the multimodal domain distributions(2)Preventing the incorrect modes’ alignment in the adversarial and clean distributions(3)Minimizing the intraclass distance by aligning the clean and adversarial samples from the same class as closely as possible(4)Maximizing the interclass distance by aligning the clean and adversarial samples from different classes as far as possible leads to a reduced effect of adversarial attacks since the required perturbation to construct an adversarial example would have to be large

Accordingly, in this paper, we propose a novel multiple adversarial domain adaptation (MADA) method, which uses adversarial domain adaptation for learning robust domain-invariant features. Instead of simply considering the classification loss on adversarial and clean samples, as in AT, we consider finding an optimum alignment of the adversarial and clean domains. This helps in decreasing the sample space for adversarial examples. The overall architecture consists of three components: a feature generator, a domain critic, and a classifier. The domain critic is trained to play a min-max game with the feature generator [13] by maximizing the Wasserstein distance between the adversarial and the clean samples. On the other hand, the feature generator is trained to produce robust features by minimizing the Wasserstein distance between the adversarial and the clean samples. The feature generator also considers a classification loss to prevent any incorrect alignment of modes in the adversarial and clean distributions and considers a triplet loss to minimize the intraclass distance and maximize the interclass distance.

In short, the contributions of this paper are directed at improving the generalization of AT on both adversarial and clean samples by formulating the problem as a multiple-domain adaptation task where adversarial domains represent target domains. Specifically, we introduce a novel domain adaptation approach and employ it to minimize the distance between the clean and adversarial domains. We evaluate our MADA method using MNIST and CIFAR-10 datasets on FGSM, PGD, and BIM adversarial attacks. Experimental results show that MADA generalizes better than AT on all conducted tests.

2. Background

This section covers the necessary background materials, organized into the following three sections.

2.1. Adversarial Attacks

ML in general and DL in specific have achieved good performance in many areas such as computer vision, audio recognition, natural language processing, and many other domains [14, 15]. Nevertheless, recent studies showed that these systems have unpleasant susceptibility to adversarial examples [3], where a small unrecognizable perturbation is added to the input sample, causing the model to misclassify this sample. Attackers can exploit this gap in models’ behavior, making the models useless in real-life scenarios.

Formally, suppose a multiclass classification model , where is the number of samples, is the dimension of the input space, and is the number of classes. We can generally define an adversarial attack on an input sample aswhere is the allowed perturbation, which determines the amount of change that we can add to the input. The amount of perturbation is measured using the norm. Different attacks use different norms, and the most popular ones are , , and .

Adversarial attacks are classified according to their knowledge of the target model into white-box, gray-box, and black-box attacks [16]. In the case of white-box attacks, the adversaries have full knowledge of the target model, including model architecture and weights, which makes crafting such attacks easier. In gray-box attacks, the adversaries have limited knowledge of the target model, such as the predicted probability of each class. In black-box attacks, the attacker can only query the model to get the final prediction, and the restriction could include the number of allowed queries. However, the existing white-box attacks are transferable to many gray-box and black-box settings [7].

Several attack methods were introduced in the literature to find . The first method was the fast gradient sign method (FGSM) [8]. It tries to maximize the loss function by finding the gradients of the loss with respect to the input sample and updating the sample along the direction of the gradient with a restriction on the norm of perturbation so that the difference between adversarial and clean samples is imperceptible. Mathematically

Further works introduced stronger iterative attacks. One example is the basic iterative method (BIM) [17], which is similar to FGSM but runs for multiple iterations. It creates iterative perturbations aswhere determines the iteration number, is the attack step or the attack learning rate, and restricts the adversarial sample to reside in the range . The attack starts from the original point itself, iteratively adds perturbations in a direction that maximizes the loss, and then clips the final result to a feasible area. This area could be the pixels range in the image domain.

Similarly, the projected gradient descent (PGD) [17] attack works iteratively and restricts the maximal perturbation by projecting the perturbed sample into a feasible area. Different from BIM, which initializes the first point as the original sample, PGD initializes the first point randomly within the area around the original sample. The noisy initialization of PGD leads to a stronger attack that converges better.

Carlini and Wagner (C and W) [18] proposed another popular attack that reformulates the optimization problem by minimizing the distance between the adversarial samples and clean samples and changes the perturbation variable so that the adversarial sample always resides in the allowed range of the optimization process. Jacobian-based saliency map approach (JSMA) [19] attack computes the Jacobian matrix of the logit layer with respect to the input and then defines the adversarial saliency map to find the best input features to change to obtain the attack. Universal adversarial patch attack finds the patch perturbation (such as an eyeglass frame) [20] in a restricted region of the input sample by optimizing the perturbation overall benign samples. Some of these adversarial attacks are summarized in Table 1.

To mitigate the effect of adversarial attacks, different adversarial defense methods were introduced, including heuristic and certificated defenses [16]. Heuristic defense methods defend against a particular attack with no guarantee of the same performance on other attacks. On the other hand, certified defense approaches give robustness certifications for the lowest performance under any adversarial attacks with well-defined constraints.

The most reliable heuristic defense method is AT [23], which improves the model’s robustness by augmenting adversarial examples to the training dataset. Randomization-based defenses [24] try to introduce some randomization to the input or the model architecture during inference. This random transformation is expected to eliminate the effect of adversarial perturbation. Other works [25] tried to use denoising on the input or high-level features to mitigate the effect of adversarial perturbation.

As mentioned earlier, certified defense methods tend to give a certificate on the model accuracy under specific situations, regardless of the used attack method to fool the model. These methods try to prove the upper bounds of the model’s robustness. For example [26], we considered an optimization procedure as a linear program that minimizes the worst case over convex relaxation of the set of activations reachable through a norm-bounded perturbation. However, scalability is still a common issue for this kind of defense.

2.2. Adversarial Domain Adaptation (ADA)

In unsupervised domain adaptation, we are given a labelled source domain of source samples and an unlabeled target domain of target samples [27]. The source domain and target domain are sampled from distributions and , respectively, where . ADA aims to design and implement an adversarial learning approach for generating robust features by reducing the distance between the target and source distributions with an adaptive multiclass classifier such that the expected classification risk is minimized with a cross-entropy loss as follows:

The adversarial learning approach is implemented as a two-player game between a domain discriminator and a feature extractor . is trained to differentiate between the source domain and the target domain, whereas is trained to fool .

extracts domain-invariant features by learning the parameters that maximize the loss of domain discriminator . The domain discriminator tries to distinguish samples from the two domains by learning the parameters that minimize the loss . Moreover, the parameters of the label predictor are learned to predict the class label of the input sample. Thus, the objective function of the domain adversarial network iswhere and is a trade-off parameter. At the point of convergence, we get the parameters as the optimal solution for equation (5) as

2.3. Wasserstein Distance

The interesting approach of adversarial learning consisting of a discriminator and a generator trying to compete against each other motivates both to improve their functionalities and eventually converge. Basically, the loss function of the generator evaluates how close the synthetic data distribution is and the real data distribution, which is measured using Jensen–Shannon (JS) and Kullback–Leibler (KL) divergence [28]. However, using the JS metric for simultaneous adversarial learning between the discriminator and the generator cannot guarantee a convergence, especially, when the two distributions are disjoint [29]. When the discriminator is perfect, we get and , where is the real distribution and is the generated distribution. In this case, the loss function is equal to zero, and we have no gradients to update the parameters of the generator and the discriminator during learning.

To solve these issues, the Wasserstein metric was used instead of JS divergence since it has a much smoother value space [30]. Wasserstein Distance, also known as Earth mover’s distance, is another metric for measuring the distance between distributions. It is interpreted as the minimum energy required to transform one distribution to look like another. The formula of the Wasserstein distance iswhere is the transport plane, and it is a joint distribution, from which is the set of all possible joint probability distributions between and . Specifically, is the percentage of mass that should be moved from point to so that comes from the same probability distribution of . Once the amount of mass required to be moved from to is moved, the marginal distribution over should add up to : and similarly . For finding the cost using EM, is treated as the amount of mass to be moved, and is the mass traveling distance. The greatest lower bound indicates the minimum cost among all visible ones. Then, the expected cost averaged across all the pairs are

However, finding all the possible joint distributions in is an intractable problem. Thus, the authors of [30] proposed to solve the dual problem using the Kantorovich–Rubinstein duality, which is expressed as follows:

In the above equation, the lower upper bound is made over all the K-Lipschitz functions [31] and is the Lipschitz constant for the function . comes from a family of K-Lipschitz continuous functions, , parameterized by . Intuitively, the Lipschitz constraint makes the function smoother and prevents it from fast changes. Two common approaches exist for enforcing the constraint of 1-Lipschitz in the above equation: gradient penalty and weight clipping. Gradient clipping restricts the weight value in , into a certain range controlled by the hyperparameters . However, this method may undergo a gradient vanishing problem. On the other hand, gradient penalty works by enforcing the gradients to have a norm at most 1 everywhere:where points are defined not only in real and generated samples but at all points between them, and all points should have a gradient norm of 1 for .

To utilize the Wasserstein distance in AT, the discriminator’s loss function is configured as measuring the Wasserstein distance between the two distributions and . Thus, the discriminator function is not to directly differentiate between fake samples and real ones. Instead, it is trained to find the Wasserstein distance between the two distributions by learning a K-Lipschitz continuous function. That is why, it is called a critic. As the generator generates more realistic samples that are similar to the original ones, its loss function decreases during training, and the Wasserstein distance gets smaller. Therefore, the training function should find the optimal value for parameters of the function from the following formula:

3. Multiple Adversarial Domain Adaptation

In this section, we describe our proposed approach (MADA). We first formulate the problem as a domain adaption problem and then explain how MADA achieves the global domain alignment and class-level alignment.

3.1. Formulation

As mentioned earlier, in MADA, we formulate the defense against adversarial attacks as a domain adaptation problem. We are given one clean domain of clean samples and adversarial domains of adversarial samples of each adversarial distribution corresponding to each adversarial attack and where is the number of adversarial distributions or considered adversarial attacks. The clean domain and adversarial domains are sampled from distributions and , respectively, where . MADA aims at designing and implementing an adversarial learning approach for learning robust features and adaptive multiclass classifier while reducing the distance between the adversarial and clean distributions, such that the expected risk on the adversarial domains is minimized.

However, in the adversarial domain adaptation problem, the class boundaries in clean and adversarial distributions could have complicated multimodal structures. Thus, the formulation of the problem in (5) might not maximally match the distributions, or they could be incorrectly aligned. To solve this problem, some studies design multiple class-wise domain discriminators. The idea is to use one separate discriminator for aligning each semantic class, which helps in mitigating the incorrect alignment of domain distributions. However, allocating separate discriminators does not consider the interclass relationship and basically forces all classes to be orthogonal with each other [32]. Employing class structural information from the label space could help in capturing the multimodal structure, especially, in the problem of the adversarial attack where the class relationships should remain consistent across the adversarial and clean domains.

Thus, we are targeting a multiadversarial domain adaptation method for solving the problem of adversarial attacks. The extracted features should guarantee that a clean sample and adversarial samples generated from it should be as close as possible in the embedding space (intraclass distance minimization). On the other hand, the clean samples from other classes and adversarial samples generated from them should be as far as possible from . For this purpose, MADA automatically and adaptively searches for robust generalized features shared by clean and adversarial domains.

In other words, MADA is a new domain adaptation-inspired method that jointly aligns the clean and adversarial distributions at both class level and data level. To minimize the statistical distribution distance at the data level, we use Wasserstein distance, whereas we adapt a triplet loss to align the adversarial and clean distributions at the class level. Adversarial learning is used to reduce the domain shift between the distributions by performing an adversarial game between a feature extractor from one side and a domain critic from the other side. Through the data-level and class-level alignment approach, discriminative and robust domain-invariant features could be learned. Therefore, the feature space shared by all domains can be automatically discovered after the feature generator fools the domain critic successfully. The full architecture of MADA components is shown in Figure 1.

3.2. Global Domain Alignment

Three elements are involved to globally align the distributions using Wasserstein distance in this stage, namely the feature extractor , classifier , and domain critic . Global domain alignment is achieved after finishing adversarial learning between the feature extractor and other components, and domain-invariant robust features are obtained.

As was mentioned in (11), the loss function of the critic is adapted as finding the best that minimizes the Wasserstein distance between source and target domains. In our problem, we have the clean domain as the source domain and the family of adversarial attack domains as the target domains. By generalization to domains, equation (11) becomes

To enforce the Lipschitz constrain, we use gradient penalty as follows:

Therefore, the objective function of domain critic becomeswhere is a balancing parameter. The other component of adversarial learning is the feature extractor , and its goal is to generate domain-invariant features by minimizing the Wasserstein distance between the clear distribution from one side and the adversarial distributions from the other side with respect to parameter while keeping the parameters of the critic fixed as follows:

The goal of training the classifier is to find the optimal for classifying the samples from the clean and adversarial domains. The classifier depends on the features generated by as input and contains many fully connected layers. The objective function for optimizing iswhere here is the cross-entropy loss and is the available-labeled samples in the clean and adversarial domains. The objective function becomes the following equation:

In the original WGAN [30] paper, the authors suggest performing five training updates for the discriminator for each update of the generator (critic training step ). This number is not fixed and should be changed according to the network architecture complexity of the generator and discriminator. However, in our setting, we observe that when we increase the network complexity, the generator could easily overcome the discriminator after a relatively small number of epochs. In this case, the required critic training step becomes large (more than 15). This makes the training very expensive since we need to perform gradient penalty estimation in every step. That is why, we slightly modify the training process so that we do not perform any unnecessary updates for the domain critic but at the same time do not allow the generator to overcome the discriminator by adaptively changing the critic training step . We check the Wasserstein distance for the discriminator and generator and keep updating the discriminator until is larger than . Our final algorithm is described in Algorithm 1.

3.3. Class-Level Alignment

Reducing the global distribution discrepancy, without considering the class-level association among the source and target samples, could lead to semantic misalignment. To solve this problem, we add a class similarity-preserving constraint to our objective function. As a result, samples with the same labels should be pulled closer to the feature space, and samples with different labels should be pushed far from each other. This class-level alignment can be implemented by minimizing a triplet loss so that clean and adversarial features embedding maintains intraclass closeness and interclass separability [33]. Triplet loss operates on three samples as input: an anchor that is any arbitrary sample, a positive sample that has the same class as the anchor, and a negative sample that has a different class from the anchor. Triplet loss works by minimizing the distance in the feature space between the anchor sample and the positive sample and maximizing the distance between the anchor sample and the negative sample. Mathematically, the triplet loss function is defined as follows:where is the margin by which the distance between the anchor and the positive sample is at least larger than the distance between the anchor and the negative sample.

In our setting, the anchor is the clean sample classified by a classifier with a true label , the positive sample is an adversarially perturbed sample classified with label , but it should be classified as , and the negative sample is an adversarially perturbed sample classified with label , but it should be classified as . is a classifier trained on the clean domain only, so it considers accuracy as the only measure of performance and does not consider the robustness of the model. Hence, the triplets set in our settings are

For each of the target adversarial domains, we compose triplet training samples by attacking the clean source dataset with an adversarial attack of each of the adversarial target domains. We use the batch hard strategy for choosing proper positive and negative samples. For each clean anchor sample in the batch, the hardest positive sample, or the farthest positive sample, is and the hardest negative sample, or the closest negative sample, is . Our final objective function becomes as follows:

Require: A clean dataset , minibatch size , learning rates , critic training step , adversarial domains d
1Initialize feature generator , domain critic , and classifier , with weights
2Repeat
3Sample clean minibatch
4for do
5Use the current state of to generate adversarial examples and create adversarial minibatch
6end for
7while do
8Find ,
9
10end while
11Find ,
12
13

4. Experiments

In principle, our method can be applied to any dataset and any adversarial attack. For comparison against adversarial training, we focus on two datasets on MNIST [34] and CIFAR10 [35] and three adversarial attacks PGD, BIM, and FGSM. For all experiments, we normalize the pixel values to the range [0, 1].

4.1. Experiment Setup

In every training iteration, we use FGSM, PGD, and BIM to generate three adversarial targets on the fly. To evaluate the effectiveness of our method, we compare our MADA method with(1)normal training (NT) with cross-entropy loss [36] on the clean training data(2)adversarial training (AT) with the cross-entropy loss on the clean training data and the adversarial examples from the FGSM, PGD, and BIM(3)MADA without triplet loss, where we remove the triplet loss to measure the effect of class-level alignment(4)MADA without triplet loss and classification loss, where we keep only the global domain alignment loss

For each dataset, we train a vanilla model (NT), MADA model, and three above-explained adversarial models with perturbation for comparison and evaluate these models on FGSM, PGD, and BIM attacks bounded by the same . We consider as a measure of perturbation in all attacks. The experiments were implemented on a single GeForce GTX 1080 Ti.

The network architecture and training parameters are chosen so that they work, but they could be optimized to have better performance. In principle, any conventional image classification model can be used. The features’ generator consists of a stack of convolutional layers, while the critic and classifier consist of a stack of fully connected layers. While any optimization method can be used for training, we choose Adam optimization [37] for training all the components with a batch size of 64, 200 epochs, and . The learning rate starts at 0.005 and is decayed by 2 every 30 epochs. After training, the domain critic can be removed, and the robust feature generator and the classifier can be used instead of the conventional image classifier.

4.2. Experimental Results
4.2.1. Results on MNIST

Since it is not hard to classify MNIST, we use simple network architectures for the different components shown in Table 2. The allowed adversarial perturbation , in this case, is 0.3, and the maximum number of iterations for BIM and PGD is 30. The accuracy results are reported in Table 3. NT has the best accuracy on clean data but has the worst robustness or accuracy on adversarial samples. The accuracy on clean samples is almost the same between AT and MADA, but MADA increases robustness significantly on adversarial samples. We also notice the importance of classification and triplet loss, where the accuracy of MADA decreases significantly on both clean and adversarial samples when we remove them. The results in Table 3 show that MADA efficiently exploits the three proposed losses to find the best alignment between adversarial and clean domains without sacrificing the accuracy on clean samples.

4.2.2. Results on CIFAR10

Here, we use the VGG architecture since classifying CIFAR10 is harder than MNIST. The convolution layers compose the feature extractor, and the last fully connected layers form the classifier and the domain critic which do not change from MNIST as shown in Table 4.

Here, the maximum allowed perturbation is , and the number of iterations for BIM and PGD is 30. We can observe from the results in Table 5 that MADA leads to a very small drop in the clean accuracy while increasing the robustness of adversarial samples compared to AT. These results on CIFAR correspond to previously noticed observations on MNIST and show that MADA surpasses AT on both clean and adversarial samples.

4.2.3. Feature Visualization

We further investigate the difference in the distribution of the extracted features between NT and AT for the MNIST dataset. We use t-SNE (t-distributed stochastic neighbor embedding) to plot the embedded features in two-dimensional space. Figure 2 shows feature visualizations of testing data and adversarial data from FGSM and PGD for the NT, AT, and MADA models. The color in Figure 2 corresponds to different classes. The figures show that our method forces the model to make the data from the same class to be as close as possible to each other and as far as possible from samples from different classes. We notice also that the constructed adversarial samples in MADA are farther from the center of the class compared to NT and AT, which means that the adversarial methods need to add stronger perturbations in order to fool the model.

5. Conclusion

In this paper, we design a domain adaptation-based approach to boost the performance of adversarial training on adversarial samples. The proposed approach reduces the effect of adversarial attacks by aligning the adversarial domain distributions near the clean distribution in the feature embedding space. The experimental results show that our approach increases the generalization of the model in the adversarial domain and gives a better interpretation of the features in the embedding space. Our approach can be further developed by studying different ways to align different distributions rather than Wasserstein distance, which we keep for further research.

Data Availability

Previously reported datasets were used to support this study and are available at DOI: 10.1109/MSP.2012.2211477 and DOI: 10.1.1.222.9220. These prior studies and datasets are cited at relevant places within the text as references [28, 29].

Conflicts of Interest

The authors declare that they have no conflicts of interest.