Abstract
The existence of adversarial examples and the easiness with which they can be generated raise several security concerns with regard to deep learning systems, pushing researchers to develop suitable defence mechanisms. The use of networks adopting errorcorrecting output codes (ECOC) has recently been proposed to counter the creation of adversarial examples in a whitebox setting. In this paper, we carry out an indepth investigation of the adversarial robustness achieved by the ECOC approach. We do so by proposing a new adversarial attack specifically designed for multilabel classification architectures, like the ECOCbased one, and by applying two existing attacks. In contrast to previous findings, our analysis reveals that ECOCbased networks can be attacked quite easily by introducing a small adversarial perturbation. Moreover, the adversarial examples can be generated in such a way to achieve high probabilities for the predicted target class, hence making it difficult to use the prediction confidence to detect them. Our findings are proven by means of experimental results obtained on MNIST, CIFAR10, and GTSRB classification tasks.
1. Introduction
Deep neural networks can solve complicated computer vision tasks with unprecedented high accuracies. However, they have been shown to be vulnerable to adversarial examples, namely, properly crafted inputs introducing small (often imperceptible) perturbations, inducing a classification error [1–3]. The possibility of crafting both nontargeted and targeted attacks has been demonstrated, the goal of the former being to induce any kind of classification error [4, 5], while the latter aims at making the network decide for a target class chosen a priori [1, 6]. It goes without saying that, in general, targeted attacks are more difficult to build.
As a reaction to the threats posed by adversarial examples, many defence mechanisms have been proposed to increase the adversarial robustness of deep neural networks [7–12]. However, in a whitebox setting wherein the attacker has a full knowledge of the attacked network, including full knowledge of the defence mechanism, more powerful attacks can be developed, thus tipping again the scale in favour of the attacker [4, 13].
In this race of arms, a novel defence strategy based on errorcorrecting output coding (ECOC) [14] has been proposed recently in [15], to counter adversarial attacks in a whitebox setting. More specifically, given a general multiclass classification problem, errorcorrecting output codes are used to encode the various classes and represent the network’s outputs. To explain how, let us refer to the output of the last layer of the network, prior to the final activation layer, as logit values or simply logits. In general, the final activation layer consists of the application of an activation function, which maps the logits into a prescribed range, and a normalization layer, which maps the output of the activation functions into a probability vector, associating a probability value to each class. In the common case of onehotencoding, a softmax layer is used, in which case these two steps are performed simultaneously. During training, the network learns to output a large logit value for the true class and small values for all the others. With the ECOC approach, instead, the network is trained in such a way to produce normalized logit values that correlate well with the codeword used to encode the class the input sample belongs to. In general, ECOC codewords have many nonzero values, thus marking a significant difference with respect to the onehotencoding case.
The rationale behind the use of the ECOC architecture to counter the construction of adversarial examples [15] is that while with classifiers based on standard onehotencoding the attacker can induce an error by modifying one single logit (reducing the one associated to the groundtruth class or increasing the one associated to the target class), the final decision of the ECOC classifier depends on multiple logits in a complicated manner, and hence it is supposedly more difficult to attack (especially when longer codewords are used).
In [15], the authors considered nontargeted attacks in their experiments and showed with the popular whitebox C&W attack that the attack success rate on CIFAR10 [16] passes from 100%, for onehotencoding, to 29%, for an ECOCbased classifier.
Another alleged advantage of the ECOC architecture proposed in [15] is linked to the way the probabilities associated with each class are computed. Rather than using a softmax function as commonly done with onehotencoding, first the correlation between the activated outputs and the codeword is computed, and then a linear normalization procedure is applied (see equation (2) in the following). In this way, the probability assigned to the class chosen by the classifier grows more smoothly, and samples close to the decision region boundary (like adversarial examples are likely to be) are classified with a low confidence. Results presented in [15], in fact, show that the ECOC model tends to provide sharp results for clean images, while it is often uncertain about the (incorrect) prediction made on adversarial examples. This behavior could be exploited to, at least, distinguish between adversarial examples and benign inputs.
The goal of this paper is to further verify if and to which extent the use of error correction codes to encode the output of deep neural networks allows to increase the robustness against targeted adversarial examples. We do so by introducing a new whitebox attack, inspired to C&W attack, explicitly thought to work not only against ECOC but also other multilabel classifiers. In fact, the original C&W is naturally designed to deceive networks adopting the onehotencoding strategy, and it loses some of its advantages when used against ECOC systems. We stress that, in contrast to previous works (see, for instance, [15] and Section 10 in [17]), we aim at developing a targeted attack, which is a more difficult task than crafting nontargeted adversarial examples. This is a reasonable choice for at least two reasons. First, targeted attacks are more flexible than nontargeted ones since they can be used in a wider variety of applications, wherein the ultimate goal of the attack may vary considerably. Secondly, being able to attack a defence under most stringent attacking constraints illustrates better the weakness of the defence itself.
We ran extensive experiments to evaluate the ability of ECOCbased classifiers to resist the new attack and compared the results we got with those obtained by applying a finetuned version of C&W attack and the LOTS attack introduced in [18]. The experiments were carried out by considering three different classification tasks, namely, traffic sign classification (GTSRB) [19], CIFAR10 classification [16], and MNIST [20]. As a result, we found that the ECOC classifiers can be successfully attacked with a high success rate. In particular, the new attack outperforms the other two especially when long codewords are used by ECOC. We also verified that, by increasing the confidence of the attack, adversarial examples can achieve high probabilities for the predicted target class, similar to those of benign samples, hence making it difficult to use the prediction confidence to detect adversarial samples. Overall, our analysis reveals that the security gain achieved by the ECOC scheme is a minor one, thus calling for more powerful defences.
The rest of this paper is organised as follows: we first briefly review the ECOC scheme presented in [15], and then we describe the proposed attack. The setup considered for the experiments, and the results we got are reported and discussed in Section 4. Eventually, we review the related work at the end of the paper.
2. ECOCBased Classification
Let us first introduce the notation for a general multiclass CNN. Let be the input of the network and the class label, , where denotes the number of classes. Let indicates the decision function of the network. We denote by , the vector with the logit values, that is, the network values before the final activations and the mapping to class probabilities. For onehotencoding schemes, has length , and the logits are directly mapped into probability values through the softmax function as follows:for . Then, the final prediction is made by letting .
The errorcorrectionoutputcoding (ECOC) scheme proposed in [15] assigns a codeword of length to every output class . denotes the codeword matrix. Each element of can take values in . In this way, the length of the logit vector is . The logits are first mapped into the range by means of an activation function (e.g., the tanh function that ). Then, the probability of class is computed by looking at the correlation with , according to the following formula:where denotes the inner product and is a sigmoid activation function applied elementwise to the logits. Since s take values in , the max is necessary to avoid negative probabilities. According to [15], the common softmax rule (equation (1)) is able to express uncertainty between two classes only when the logits are roughly equal (i.e., and the two probabilities are close ). In a two dimensional case, this corresponds to a very narrow stripe, approximate to a line, across the boundary of the decision region, while in high dimensional spaces, the region approximates a hyperplan, an dimensional subspace of with negligible volume, and hence the classifier outputs high probabilities almost everywhere. This makes it very easy for the attacker to find an adversarial input that is predicted (wrongly) with high confidence. With ECOC (equation (2)), instead, it is sufficient that two approximate correlations express low uncertainty , and then a nontrival volume is allocated to lowconfidence region in the logit space, thus limiting the freedom of the attacker to craft highconfidence adversarial examples. An overall sketch of the ECOC scheme is depicted in Figure 1. The logits are first mapped into correlation values, (mapping step 1), and then the vector with the correlations is normalized so to form a probability distribution (mapping step 2) via the normalization function in (2). The model’s final predicted label is . Equation (2) is a generalization of the standard softmax activation in equation (1) and reduces to it for the case of onehotencoding, that is, when , with , and where is the identity matrix.
The purpose of the ECOC architecture is to design a classifier which is robust to changes of multiple logits and then, expectedly, more difficult to attack (with standard onehotencoding the adversary can succeed by altering a single logit). For the scheme to be effective, codewords characterised by a large minimum Hamming distance must be chosen. For simplicity, in [15], the ECOC classifier is built by using Hadamard codes taking values in (when is a Hadamard matrix, the Hamming distance for large approaches the limit value ). An advantage with this choice is that, since is orthogonal, whenever the network outputs a codeword exactly (that is when ), then . The tanh function is selected as the activation function .
The authors also found that, rather than considering a single network with outputs, a classifier consisting of an ensemble of several smaller networks, each one outputting a few codeword elements, permits to achieve a larger robustness against attacks. By training separate networks, in fact, the correlation between errors affecting different bits of the codewords is reduced, thus forcing the attacker to attack all the bits independently. In the scheme in Figure 1, every network outputs one codeword bit only, resulting in ensemble branches.
3. Attacking ECOC
We start by considering the basic C&W attack introduced in [6]. We notice that some of the good properties of C&W do not hold longer when the attack is applied against the ECOC scheme since it has been originally designed to work against networks adopting the onehotencoding strategy. Then, we propose a new more effective attack, which is specially tailored to multilabel structures like ECOC.
In general, constructing an adversarial example corresponds to finding a small perturbation (under some distance metric) that once added to image will change its classification. Such a problem is usually formalised aswhere is some distance metric (e.g. the metric) and is a chosen target class. As this problem is difficult to solve, C&W attack aims at solving its Lagrangian approximation defined aswhere the second term is any function such that if and only if this term . denotes the norm, and are constant parameters ruling, respectively, the tradeoff between the two terms of the optimization problem and the confidence margin of the attack (In [6], , and the minimization is carried out over to have box constraints on δ when optimizing equation (4) with a common optimizer like Adam.). Equation (4) is designed for the common onehot encoding case. In fact, it is easy to see that for ECOC the motivation of such a design does not hold anymore and ensure that the second term is less than and does not guarantee that . Therefore, the C&W attack must be adjusted to fit the ECOC framework. By noting that, in ECOC, correlations are proportional to probabilities (instead of the logits as with onehot encoding), and C&W shall be modified aswhere .
A key advantage of C&W attack against onehotencoding networks is that it works directly at the logits level. In fact, logits are more sensitive to modifications of the input than the probability distribution obtained after the softmax activation (most adversarial attacks work directly on the probability values obtained after the softmax, which makes them less effective than C&W and prone to gradientvanishing problems).
When C&W attack is applied against ECOC (by means of (5)), it does not work at the output logit level, but after that, the correlations are computed (mapping step 1) since this is the layer that precedes the application of the softmaxlike function. The correlations between the activations of the logits and the codewords will likely have a reduced sensitivity to input modifications, and this may decrease the effectiveness of the attack. We also found that during the attack, it is possible to change only one bit of the output while the others are almost unchanged. This can be explained by observing that ECOC trains each output bit separately, so that each bit can be treated as an individual label. In this way, the correlation between the output bits is significantly reduced compared to classifiers adopting the onehot encoding approach. We exploit this fact to design our attack in such a way as to make it modify a single bit at a time and iteratively repeat this process to eventually change multiple bits.
With the above ideas in mind, the new attack is formulated as follows:where is the desired target codeword , is a parameter controlling the tradeoff between the two terms of the objective function, and is a constant parameter used to set a confidence threshold for the attack. Specifically, the attack seeks to minimize (6) until the product between and reaches this threshold; thus, a higher will result in adversarial examples exhibiting a higher correlation with the target codeword, that is, adversarial examples that are (wrongly) classified with a higher confidence.
The choice of also plays an important role in the attack, given that a very small would lead to a vanishing perturbation. On the contrary, using larger results in a more effective attack at the cost of a larger perturbation. To optimize the value of , we use a binary search similar to the one used in [6] to determine the optimum value of in C&W attack. By doing so, the parameters of the proposed attack have the same meaning of those in C&W attack; thus, the two methods can be compared on a fair basis under the same parameter setting. An overall description of our attack is given in Algorithm 1, whose goal is to find a valid adversarial example, with the desired confidence level and with the smallest perturbation. As a result of the optimization in Algorithm 1, all logit values of the resulting adversarial image will tend to be highly correlated with .

It is worth observing that, even if we designed the new attack explicitly targeting the ECOC classifier, the algorithm in (6) is generally applicable to any multilabel classification network since it manipulates the output bits of the network, regardless of the adopted coding strategy. This point can be evidenced by considering two limit cases of ECOC. In the first case, we avoid using error correction to encode the output classes. This is equivalent to multilabel classification problems with labels [21, 22], and the proposed attack can still be applied. In the second case, we may consider onehotencoding as a particular way of encoding the output class. This perspective, also been considered in [15], would degrade the ECOC system to a common network that uses onehotencoding and softmax to solve a multiclass classification task. Since our attack does not involve the decoding part of the network, it can still be applied to such networks.
4. Experiments
4.1. Methodology
In [15], the authors tested the robustness of the ECOC architecture for various combinations of codeword matrices , activation functions , and network structures. In particular, they considered the MNIST [20] and CIFAR10 [16] classification tasks (M = 10 in both cases). In the end, the best performing system was obtained by considering a Hadamard code with and the tanh activation function. An ensemble of 4 networks each one outputting 4 bits was considered. The authors argue that using a large number of ensembles increases the performance of the system against attacks (by decreasing the dependency among output bits). Then, in our experiments, we used ensembles, with only one output bit each. The authors also indicate that the robustness of ECOC scheme can be improved by using longer codewords. Then, in our experiments, in addition to MNIST and CIFAR10 already considered in [15], we also considered traffic sign classification (GTSRB dataset) [19], to test the robustness of ECOC on a larger number of classes and with codewords of a larger size, which potentially means higher robustness. To be specific, for traffic sign classification, we set , by selecting the classes with more examples among the total number of 44 classes in GTSRB, and chose a Hadamard code with , which is twice the size of the code used for MNIST and CIFAR10. A diagram of the ECOC scheme with the ensemble structure is shown in Figure 1. We used a standard VGG16 network [23] as the base block of our implementation. Following the ECOC design scheme, the first 6 layers form the socalled shared bottom part, that is, the layers shared by all the networks of the ensemble. Then, the remaining 10 layers (the last 8 convolutional layers and the 2 fully connected layers) are trained separately for each ensemble branch.
For each task, we first trained one class classification network, and then we finetuned the weights to get the ensemble networks. The error rates of the trained models on clean images are equal to 2.14% for MNIST, 13.9% for CIFAR10, and 1.28% for traffic sign (GTSRB) classification.
In addition to the extended C&W attack described in Section 3, we also considered a new attack named layerwise origintarget synthesis (LOTS) introduced in [18]. In a few words, LOTS aims at modifying the deep representation at a chosen attack layer, by minimizing the Euclidian distance between the deep features of the tobeattacked input and a target deep representation chosen by the attacker. In our tests, we applied LOTS to the logits level, and we obtained the target deep representation (logits) by randomly choosing 50 images belonging to the target class.
4.2. Results
We attacked 300 images randomly chosen from the test set of each task. For each attack, we carried out a targeted attack with the target class chosen at random among the remaining classes (i.e., all the classes except the original class of the unperturbed image). The label of the target class was used to run the C&W attack in equation (5) and LOTS, while the codeword associated to is considered in (6) for the new attack. We use the attack success rate (ASR) to measure the effectiveness of the attack, i.e., the percentage of generated adversarial examples that are assigned to the target class, and the peak signaltonoise ratio (PSNR) to measure the distortion introduced by the attack, which is defined as , where is the norm of the perturbation and is the size of the image.
As the parameters of the new attack have the same meaning as those of C&W attack, we first compare the C&W and the new attack with several settings of the input parameters. The results we got are shown in Tables 1–3, for GTSRB, CIFAR10, and MNIST, respectively. In all the cases, was set to 0. The results obtained by using the C&W attack against the standard onehotencoding VGG16 network with classes are also reported in the last column. By looking at the different rows, we can first see that when the strength of the attack is increased, e.g., by using a larger number of iterations or a larger number of steps during the binary search, the ASR of both attacks increases, at the price of a slightly larger distortion. For instance, for CIFAR10, the ASR of the proposed attack increases from 69.3% to 98.6%, with a decrease in the PSNR of less than 1 dB, and the ASR of the C&W attack increases from 53.6% to 92.6% with an extra distortion of 3 dB. Then, by comparing different columns, we see a clear advantage of the proposed attack over C&W attack since the former achieves a higher ASR for the same parameter settings.
By comparing the different tables, we see that the advantage of the new attack is more evident with GTSRB than with CIFAR10. The use of longer codewords in GTSRB, in fact, makes it harder to attack this classifier; however, the new attack can still achieve an ASR = 93.3% with a PSNR equal to 39 dB.
For MNIST dataset, the ASR is lower compared to the CIFAR10 and GTSRB. This result agrees with the results reported in [15]. One possible explanation of this fact is advanced in [10] where the peculiarities of the MNIST dataset are highlighted and used to argue that high robustness can be easily reached on MNIST.
The comparison with LOTS must be carried on a different ground since such an attack is designed in a different way, and the only parameter shared with the new attack is the maximum number of iterations allowed in the gradient descent. For this reason, we applied LOTS by allowing a maximum number of iterations equal to 2000, which is the same number we have used for the other two attacks. We have verified experimentally that LOTS converges within 1000 iterations 92% of the times (The convergence is determinate by checking whether the new loss value is close enough to the average loss value of the last 10 iterations.), thus validating the adequacy of our choice. Then, we measured the ASR for a given maximum PSNR, thus allowing us to plot the ASR as a function of PSNR. The results we got are shown in Figure 2. Upon inspection of the figure, we observe a behavior similar to Tables 1–3. The proposed attack greatly outperforms LOTS and C&W on GTSRB when longer codewords are used. The ASR of the new attack, in fact, achieves nearly 100% for smaller PSNR’s, while LOTS and C&W stop at 42% and 42.3%, respectively. For the other two datasets, the gap between the different attacks is smaller than in the GTSRB case. Specifically, the proposed attack and LOTS perform almost the same on CIFAR10, while LOTS provides slightly better results on MNIST. This observation can also be verified in Figure 3, where we show some images that are successfully attacked by all the attacks. We can see from the figure that the proposed attack requires less distortion to attack the selected examples, producing images that look visually better than the others. The advantage is particularly evident for the GTSRB case, but is still visible for the CIFAR10 and MNIST images.
(a)
(b)
(c)
As for time complexity, we observe that though our attack aims at modifying fewer bits each time, its complexity is very similar to that of C&W attack. Specifically, if we allow 2000 iterations (10 binary searches) for each attack, for CIFAR10, the new attack and C&W attack require about 800 seconds and 1000 seconds to attack an image, respectively (The times are measured using one NVIDIA RTX2080 GPU without paralleling.). On the other hand, LOTS is considerably faster since it needs about 80 seconds to attack an image. The reason behind the high computational complexity of C&W and our new attack is the binary search carried out at each step. In fact, we verified that, by reducing the number of steps the binary search consists of, the speed of both attacks improves greatly. However, since our main purpose is to test the robustness of the ECOC system, we did not pay much effort to optimize our attack from a computational point of view, all the more that its complexity is already similar to that of C&W attack.
As an overall conclusion, the experimental analysis reveals that, in the whitebox scenario, the security gain that can be achieved through the ECOC scheme is quite limited since by properly applying existing attacks and especially by using the newly proposed attack, the ECOC classifiers could be attacked quite easily.
Another expected advantage of ECOC is that adversarial examples tend to be classified with a lower probability than clean images. Here, we show that such a behavior can be inhibited, at the price of a slightly larger distortion, by increasing the confidence of the attack. If a larger confidence margin is used, in fact, the model becomes more certain about the wrongly predicted class. To back such a claim, in Table 4, we report the results of the new attack for different confidence values for the CIFAR10 case (To clearly show the effect of confidence, we did not consider adversarial examples that do not reach the chosen confidence margin , which leads to a slight drop of the ASR.). The table shows the average probability assigned by the ECOC model to the original class (Prob. true class) and to the target class of the attack (Prob. targ class), before and after the attack.
From the table, we see that, by increasing , the adversarial examples are assigned higher and higher probabilities for the target class, getting closer to those of the benign samples. In particular, the average probability for the target class passes from 0.546 (with ) to 0.993 (with ), which is even higher than the average probability of the clean images before the attack (0.908), and the probability of the original (true) class decreases from 0.194 (with ) to a value lower than 0.001 (with ). A similar behavior can be observed for the C&W attack when is raised from 0 to 15.
Figure 4 shows the distribution of the probabilities assigned to the most probable class for clean and adversarial images generated by the proposed attack. The plot confirms that the ECOC classifier assigns low probabilities only in the presence of adversarial examples obtained with a low confidence value . When grows, in fact, the probability distribution of adversarial examples get closer and closer to that of clean images, and when , it becomes impossible to distinguish clean images and adversarial examples by setting a threshold on the probability assigned to the most probable class.
5. Related Works
Adversarial examples, i.e., small, often imperceptible, ad hoc perturbations of the input images, have been shown to be able to easily fool neural networks [1, 2, 5, 24] and have received great attention in the last years.
Different attacks have been proposed to obtain adversarial examples in various ways. Some works focus on diminishing the computational cost necessary to build the adversarial examples [2, 25], while others aim at lowering the perturbation introduced to generate the adversarial examples [1, 6, 26]. There are also some works whose goal is to find adversarial examples that modify only one pixel [27] and adversarial perturbations that can either fool several models at the same time [28], or can be applied to several clean images at the same time [4].
As a response to the threats posed by the existence of adversarial examples and by the ease with which they can be created, many defence mechanisms have also been proposed. According to [29, 30], defences can be roughly categorized into two branches, which either work in a reactive or proactive manner. The first class of defences is applied after the DNNs have been built. This class includes approaches exploiting randomization, like, for instance, stochastic activation pruning, in which node activations at each (or some) layers are randomly dropped out during the forward propagation pass [31], and, more recently, model switching [32], where random selection is performed between several trained submodels. Other approaches attempt to intentionally modify the network input to mitigate the adversarial perturbation, e.g., by projecting the input into a different space [33] or by applying some input transformations [34]. Other approaches attempt to reject input samples that exhibit an outlying behavior with respect to the unperturbed training data [35]. The second branch of defences aims at building more robust DNNs. One simple approach to improve the robustness against adversarial example is adversarial training, which consists in augmenting the training set with adversarial examples [10–12, 36]. More recently, as more attention has been paid to hidden layers with respect to the robustness of DNNs [37], it has been proven that rather than augmenting the training set, the robustness of DNNs can be improved by directly injecting adversarial noise into the hidden nodes, thus improving the robustness of single neurons [38, 39].
The ECOC scheme considered in this paper, belongs to the second class of defences and is derived from similar attempts made in the general machine learning literature to improve the robustness of multiclass classification problems [14, 40, 41]. The robustness of ECOC against adversarial examples is assessed in [15] by considering conventional adversarial attacks like [6, 42], which have not been explicitly designed for multilabel classification. As suggested in [17], however, in order to properly assess the effectiveness of a defence mechanism, the case of an informed attacker should be considered, and then the robustness should be evaluated against attacks targeting the specific defence mechanism. Following the spirit of [17], in this paper, we developed a targeted attack against the ECOC system, which exploits the multiclass and multilabel nature of such a system. We observe that the capability of ECOC to hinder the generation of adversarial examples has already been challenged in [17] (Section 10); however, the analysis in [17] is carried out under the more favourable (for the attacker) assumption of a nontargeted attack, thus marking a significant difference with respect to the current work [4, 43].
6. Conclusion
In order to investigate the effectiveness of ECOCbased deep learning architectures to hinder the generation of adversarial examples, we have proposed a new targeted attack explicitly thought to work with such architectures. We measured the validity of the new attack experimentally on three common classification tasks, namely, GTSRB, CIFAR10, and MNIST. The results we have got show the effectiveness of the new attack and, most importantly, demonstrate that the use of error correction to code the output of a CNN classifier does not increase significantly the robustness against adversarial examples, even in the more challenging case of a targeted attack. In fact, the ECOC scheme can be fooled by introducing a small perturbation into the images, both with the new attack and, to a lesser extent, by applying C&W and LOTS attacks with a proper setting. No significant advantage in terms of decision confidence is observed as well, given that, by properly setting the parameters of the attack, adversarial examples are assigned to the wrong class with a high probability.
Data Availability
The data used to support the findings of this study are available from the first author ([email protected]) upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was partially supported by the China Scholarship Council (CSC) (file no. 201806960079).