Abstract

Deep neural networks provide good performance in the fields of image recognition, speech recognition, and text recognition. For example, recurrent neural networks are used by image captioning models to generate text after an image recognition step, thereby providing captions for the images. The image captioning model first extracts features from the image and generates a representation vector; it then generates the text for the image captions by using the recursive neural network. This model has a weakness, however: it is vulnerable to adversarial examples. In this paper, we propose a method for generating restricted adversarial examples that target image captioning models. By adding a minimal amount of noise just to a specific area of an original sample image, the proposed method creates an adversarial example that remains correctly recognizable to humans yet is misinterpreted by the target model. We evaluated the method’s performance through experiments with the MS COCO dataset and using TensorFlow as the machine learning library. The results show that the proposed method generates a restricted adversarial example that is misinterpreted by the target model while minimizing its distortion from the original sample.

1. Introduction

Deep neural networks [1] provide good performance on tasks of image recognition [2, 3], speech recognition [4], text recognition [5], and data generation [6]. Recently, deep neural networks have also demonstrated good performance for image captioning [7], in which text is generated that explains a given image.

However, such image captioning models are vulnerable to adversarial examples [813]. An adversarial example is a sample created by adding noise to an original sample in such a way that it is incorrectly classified by the target model and yet remains correctly recognizable to humans. Adversarial examples cause an image recognition model to provide erroneous results. Research on adversarial examples in the context of image captioning models is ongoing.

Previous studies of adversarial examples targeting image captioning models have generated adversarial examples by adding adversarial noise to the entire image. In certain circumstances, however, it may be advantageous to create an adversarial example by adding adversarial noise just to a specific region of an image, for example, by attaching a sticker to the image. Then, in a situation in which there is limited opportunity to add noise to the entire image, it may still be possible to attack by applying a sticker or the like, adding noise just to a specific area of the image.

In this paper, we propose a method for generating restricted adversarial examples targeting image captioning models. The method adds a small amount of noise just to a specific area of an image, creating an adversarial example that is correctly recognized by humans but misclassified by the target model. The contributions of this paper are as follows. First, we propose a method for generating a restricted adversarial example targeting an image captioning model. The underlying principle and the steps of the proposed method are systematically explained. Second, we analyze the attack success rate, distortion, and classification results for adversarial examples generated by the proposed method. Third, we report the performance of the proposed method based on the results of experiments in which the ResNet model [14] was targeted and MS COCO [15] was used as an image dataset.

The remainder of the paper is structured as follows. In Section 2, studies related to the proposed method are reviewed. Section 3 explains the proposed method. Section 4 describes the experiments and presents the results. Section 5 provides further discussion of the proposed method. Finally, Section 6 concludes the paper.

2.1. Image Captioning Model

A variety of studies [1618] are being conducted on image-related models that use deep learning technology, such as the image captioning model, whose characteristics we describe here. The image captioning model can describe each object in an image by applying a long short-term memory (LSTM) model [19] and an attention mechanism that can solve the problems of recurrent neural networks (RNNs) [20]. In this model, the input image is encoded in 512 dimensions using a CNN and is then used as the input to the LSTM to generate sentences. The model is defined as where is the overall parameter of the LSTM model, is the image, is a list of objects extracted from the image, and is the correct answer sentence. The chain rule is applied to process the variable-length sentence as follows:

Equation (2) should be optimized using pairs during training. It also uses CNNs to represent images. Currently, this model is the one most widely used in image processing problems and object recognition problems. Yolo9000 [21] is used to extract object recognition information.

For training, the LSTM model calculates each word generated by the word and image generated by . After the model is trained on the words, , which is the output of the LSTM at , is used as an input to the LSTM at : where each word is expressed as a one-hot vector. is a special character that marks the beginning. In this equation, image information generated by the CNN and words expressed in word embedding are mapped to the same space. The image is entered once at . The sentence created in this manner is subjected to attention and words generated by object extraction:

During training, words are generated by object extraction in various ways. Nouns are extracted from the correct answer sentence, and it is assumed that they are the result of object extraction. Therefore, attention is applied by extracting these nouns and nouns in the output sentence of the LSTM. A multihot vector is generated using the object-extracted noun and the noun in the LSTM output sentence. That is, the length of the vector is the same as the dictionary size, and if the noun exists, it is expressed as 1, and if it does not exist, it is expressed as 0. Cosine similarity is calculated using the multihot vector of the object-extracted noun and the multihot vector of the noun in the LSTM output sentence and is applied as a loss function. Calculating multihot vectors through cosine similarity [22] has the same effect as attention; this means that more weight is given to objects that appear both among the objects in the generated sentence and among the objects in the correct caption. The loss function is expressed as the sum of the negative likelihoods of the correct words at each step:

The loss function given by Equation (5) is minimized for all parameters of LSTM by inputting image information, word embedding information, and object-extracted word information by using the CNN.

2.2. Adversarial Examples

An adversarial example, first proposed by Szegedy et al. [8], is a sample created by adding a minimal amount of noise to an original sample in such a way that it is recognized as the original class by humans but is incorrectly classified by the target model.

There are three methods for adding noise to create the adversarial sample: [23], [24], and [25]. In all three methods, the smaller the number, the more closely the adversarial example resembles the original sample. The method is the one that was used in this study.

Methods for generating adversarial examples for use in a white-box attack include the fast gradient sign method (FGSM) [26], DeepFool [27], and the Carlini and Wagner (CW) [28] method. These methods find the optimal adversarial example by reducing the gradient so that the probability value for a specific target class increases after the gradient is calculated based on the result value for each class corresponding to the value input to the target model. White-box attacks have a success rate of nearly 100% because in this type of attack, all information about the target model is known. The CW method generates an adversarial example with a high attack success rate and minimal distortion by specifically controlling these two characteristics. Much of the research, however, is focused on black-box attacks rather than white-box attacks. Black-box attack methods include the transfer attack, universal, substitute network, and decision boundary methods. The transfer attack [29] exploits the characteristic that an adversarial example generated by a random model can be effective against other models as well. It has a high attack success rate in the image field. The universal method [30] produces an adversarial example that is incorrectly classified by the model by adding a certain amount of noise to an arbitrary original sample. In this method, a relatively large amount of distortion is applied in the form of adversarial noise, which typically causes the input sample to be misclassified. The substitute network method [31] operates by creating a model similar to the target model through the use of multiple queries against the black-box model. It can then attack the black-box model after generating an adversarial example that is misclassified by the similar model. The decision boundary method [32] can be used when only the result produced by the black-box model for a given input value is known. This method applies distortion to the image just until it ceases to be recognized as the original class while maintaining the similarity between the adversarial image and the original image. It is not a mathematical method, and it cannot be said that it produces the optimal amount of noise, but it can be used to generate adversarial examples relatively easily. HopSkipJumpAttack [33] is an improved version of the decision boundary method, with proven mathematical convergence. After finding the decision boundary as a binary in the decision boundary, the method moves the boundary by epsilon in the vertical direction and then finds the adversarial example in the direction of finding it again by a binary search in the direction toward the original sample. By repeating this process several times, it is possible to generate an adversarial example that is misclassified and is only minimally distorted from the original sample. The method proposed in this paper generates a restricted adversarial example that is misclassified by the target model under the assumption of a white-box attack.

3. Proposed Scheme

3.1. Overview

The proposed method assumes a white-box scenario, in which information is available from the target model. The model provides the probability values for the result corresponding to the input case used when generating an adversarial example. Using this information, adversarial noise is added to an original image to increase the probability of its being incorrectly classified by the target model.

In the proposed method, the loss is divided into distortion loss and attack loss, unlike the loss in the existing method. Whereas conventional methods create the adversarial example by adding adversarial noise to all pixels of the original sample, the proposed method adds noise only to pixels in a specific area of the original sample. This approach allows the possibility of inducing misclassification by attaching a small sticker to a specific area of the original sample. If the adversarial noise is added to an area that is not obvious, the adversarial noise will also be less easily discerned, which is an advantage in terms of perceptibility to humans. In summary, the proposed method differs from other methods in that it adds adversarial noise only to a specific area, and it has the advantage of being able to execute an attack simply by placing a sticker on a sample.

3.2. Description of the Method

The proposed method generates a restricted adversarial example that is perceived as normal by humans but is incorrectly classified by the model. Figure 1 provides an overview of the method. As shown in the figure, a transformer receives the original sample and the original class as inputs and generates a transformed example by adding a small amount of noise to a specific part of the original sample. The transformer provides the transformed example to the target model and receives feedback on it. Through the feedback, the transformer obtains the loss function value and updates the transformed example. By repeating this process, the method adds a minimum amount of noise to a specific part of the original sample, thereby generating a restricted adversarial example that is perceived as normal by humans but is incorrectly classified by the target model.

The purpose of the proposed method is to create a restricted adversarial example. For this study, the transformer presented in [28, 34] was modified as follows:

is a specific noise and modifies only the pixels in the specific area chosen by the attacker; pixels in areas other than this restricted area are set to a fixed value of zero. The captioning model accepts as the input value and provides the loss result to the transformer. At each iteration, the transformer repeats the above procedure to generate a restricted adversarial example while minimizing the total loss , which is defined as where is the distortion component of the loss function, is the classification loss function of , and is the weight value for model and has an initial value of 1. is the distortion distance between the transformed example and the original sample :

The distortion loss function is with . should be minimized: where , in which is the original class and [28, 35] exhibits the probabilities for the class predictions by the image captioning model . predicts the probability of the incorrect class to be higher than that of the original class by optimally minimizing . Some discrete pixels can be selected even if the restricted pixel region is represented by the shape of [[, ], [, ], [, ]] .

4. Experimental Setup and Results

In this section, we present the performance analysis of the adversarial examples generated by the proposed method for the image captioning model. The TensorFlow [36] machine learning library was used as the experimental environment, and an Intel(R) i5-7100 3.90 GHz computer was used as the server.

4.1. Experimental Setup

MS COCO [15] was used as the dataset for the experiment. This dataset was created for the purpose of performing computer vision tasks such as object detection, segmentation, and keypoint detection. It has 80 object classes, more than 1.5 million object instances, and 164,062 image samples. Of the 164,062 image samples, we used 82,783 for training, 40,504 for the validation, and 40,775 for the test.

A CNN–RNN model [7] was used as the target model for the experiment. Image feature extraction was performed using the CNN model, and caption generation was performed using the RNN model. The CNN model used ResNet 101 (Table 1), and the RNN model was an LSTM with embedding of size 256 and 512 dimensions. The learning rate was 0.001, and the number of epochs was 5.

In generating the restricted adversarial example, Adam [37] was used as the optimization algorithm. Each restricted adversarial example was generated by adding a minimal amount of noise to the lower right part of an original sample, corresponding to 1/16 of its total area. The learning rate was set to 0.005, and the number of repetitions was set to 1000. The performance was analyzed on 1000 adversarial examples randomly generated in this manner.

4.2. Experimental Results

Table 2 shows examples of original image samples and their corresponding adversarial examples generated by the proposed method. It can be seen that the proposed adversarial examples are nearly identical to their corresponding original samples. This is because they are created by applying a minimal amount of distortion to the original sample and are designed to be correctly recognized by humans but incorrectly classified by the target model.

Table 3 shows the images and their top three captioning results for an original sample, the baseline adversarial example, and the proposed adversarial example. The baseline adversarial example was generated by applying the fast gradient sign method (FGSM) as the baseline method. It can be seen that the baseline adversarial example was generated by adding noise throughout the original sample. The restricted adversarial example was generated by adding a minimal amount of noise to the lower right part of the original sample, corresponding to 1/16 of its total area. As can be seen, the restricted adversarial example is nearly identical to the original sample according to human perception. However, the caption interpretations for the original sample, the baseline adversarial example, and the proposed adversarial example differed. The target model correctly captions the original sample to fit the image but misinterprets the proposed adversarial example and captions it inappropriately.

Table 4 shows a comparison of BLUE scores for the original sample, the baseline adversarial example, and the proposed adversarial example. Here again, the baseline adversarial example was generated using FGSM. The BLUE score is an evaluation index for machine translations. As shown in the following formula, the unigram accuracy, bigram accuracy, or -gram accuracy is obtained for a machine-translated sentence (by comparing it with the sentence as correctly translated), the geometric mean is taken, and then, a brevity penalty is applied if the sentence is short:

It can be seen in Table 4 that the BLUE score of the original sample is higher than those of the baseline adversarial example and proposed adversarial example. This is because the original sample is correctly recognized and is given a caption that has high accuracy, whereas the proposed adversarial example and the baseline adversarial example are misinterpreted and given incorrect captions that differ from that for the original sample.

The restricted adversarial example can be positioned anywhere on the image. The reason we have located it at the lower right is because it is easy for an attacker to apply an adversarial example by attaching a sticker to a corner of the image. If the proposed method is applied to an area other than the lower right, however, its performance remains the same. Table 5 shows possible positions for the restricted area: top left, top right, bottom right, bottom left, and center. It can be seen in the table that the proposed adversarial example is misinterpreted regardless of its position.

Figure 2 shows the attack success rate of the restricted adversarial example according to the size of the restricted area. From the figure, it can be seen that the attack success rate increases as the size of the restricted area increases. When the size of the restricted area is 1/16 of the total image area, the restricted adversarial example has an attack success rate of 100%.

5. Discussion

The proposed method creates a restricted adversarial example that is misclassified by the target model but poses no problem for human recognition. The target model generates a representation vector by extracting features from the original image and then generates a particular word through the use of a recursive neural network. The proposed method generates an untargeted adversarial example that causes the target model to misinterpret it and thus produce an arbitrary caption for it instead of the original caption. The experiments with the proposed adversarial example show that certain words in the caption provided by the target model for the proposed adversarial example differ from those provided for the original sample.

In addition, the correlation between the interpreted caption and the original caption was examined using the BLUE score as an evaluation index, and it was found that the BLUE score of the caption for the proposed adversarial example was lower than that for the original sample. This demonstrates that the proposed adversarial example is mistakenly classified and given an arbitrary caption that is different from the caption for the original image.

The proposed adversarial example is seen to be similar to the original sample in terms of human perception; this is because it is created by applying a minimal amount of distortion to the original sample. If the proposed method is used in a military scenario to generate an adversarial example by adding the optimal amount of noise to a particular image, the modified image can be misinterpreted by the enemy’s recognition model. In the healthcare field, patient CT images can be used to generate an adversarial example using the proposed method, leading to misinterpretation. Therefore, an adversarial example generated by the proposed method would pose a serious threat because of the vulnerabilities of the image captioning model.

We applied the proposed method to a second dataset, Flickr 8K [38]. Table 6 shows example images and captions for an original sample from the Flickr 8K dataset and the corresponding proposed adversarial example.

As can be seen, the proposed adversarial example is misinterpreted and given a different caption from that for the original sample. This demonstrates that the method is applicable to the Flickr 8K dataset as well as to the MS COCO dataset.

6. Conclusion

In this paper, we have proposed a method for generating restricted adversarial examples for image captioning models. This method adds noise just to a specific area of the entire image, creating an adversarial example that is correctly recognized by humans but is misclassified by the target model. The experimental results demonstrate that the proposed method generates an adversarial example that is similar to the original sample in terms of human perception and yet is misclassified by the target model.

In future studies, this research can be extended to include other image datasets and to apply to the voice and text domains. In addition, the adversarial example could be generated using a generative adversarial net [39]. Finally, it would be interesting to investigate methods of defense against the proposed method.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request after acceptance.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was supported by the AI R&D Center of Korea Military Academy, the Hwarang-Dae Research Institute of Korea Military Academy, and the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (NRF-2020R1C1C1A01005229, NRF-2021R1A4A5032622, and 2021R1I1A1A01040308).