Abstract

Deep neural networks provide good performance for image recognition, speech recognition, text recognition, and pattern recognition. However, such networks are vulnerable to attack by adversarial examples. Adversarial examples are created by adding a small amount of noise to an original sample in such a way that no problem is perceptible to humans, yet the sample will be incorrectly recognized by a model. Adversarial examples have been studied mainly in the context of images, but research has expanded to include the text domain. In the textual context, an adversarial example is a sample of text in which certain important words have been changed so that the sample will be misclassified by a model even though to humans it is the same as the original text in terms of meaning and grammar. In the text domain, there have been relatively few studies on defenses against adversarial examples compared with the number of studies on adversarial example attacks. In this paper, we propose an adversarial training method to defend against adversarial examples that target the latest text model, bidirectional encoder representations from transformers (BERT). In the proposed method, adversarial examples are generated using various parameters and then are applied in additional training of the target model to instill robustness against unknown adversarial examples. Experiments were conducted using five datasets (AG’s News, a movie review dataset, the IMDB Large Movie Review Dataset (IMDB), the Stanford Natural Language Inference (SNLI) corpus, and the Multi-Genre Natural Language Inference (MultiNLI) corpus), with TensorFlow as the machine learning library. According to the experimental results, the baseline model had an accuracy of 88.1% on the original sentences and an accuracy of 9.2% on the adversarial sentences, whereas the model that underwent the proposed training method maintained an average accuracy of 87.2% on the original sentences and had an average accuracy of 22.5% on the adversarial sentences.

1. Introduction

Deep neural networks [1] provide good performance for artificial intelligence services such as image recognition [2], speech recognition [3], and pattern analysis [4]. However, such networks are vulnerable to attack by adversarial examples [57]. Adversarial examples are samples created by adding a minimal amount of noise to an original sample in such a way that no problem is perceptible to humans but the sample will be misrecognized by the model. Adversarial examples pose a serious threat to autonomous vehicles and medical businesses.

Most research on adversarial examples has been conducted in the context of images [8], but recently, research has been conducted on adversarial examples in the field of text [6, 9] as well. In the image domain, an adversarial example that will be recognized normally by humans is created by adding a small amount of noise to an original sample in such a way that the model’s classification score for the class targeted for erroneous recognition will be high. In the text domain, by contrast, adversarial examples designed to cause misrecognition by a target model are created using the word-wise method. This method extracts candidates for important words from sentences, selects the one having the highest probability of causing misrecognition, and generates a corresponding adversarial example whose grammar and meaning as perceived by humans remains unchanged but that will cause misrecognition by the model. Research is being conducted on methods of generating adversarial examples in the field of text.

Studies are also being conducted on defenses against adversarial examples, such as the adversarial training method [10, 11]. In this method, the model is given additional training using adversarial samples already known by the model. Such defense methods have been introduced in the fields of image and voice, but not many have been proposed for the context of the text. The latest method of attack generates adversarial examples targeting the BERT model [12] rather than the LSTM model [13], and a robust defense method that trains the model on various types of adversarial examples generated using various parameters has not been introduced. In addition, there is a lack of research on a robust defense method designed for the latest BERT model (rather than the LSTM model) that operates by training the model after generating a variety of adversarial examples using various parameters.

In this paper, we propose an adversarial training method for BERT, a text-based model. The proposed method is designed to make the target model robust to unknown adversarial examples by generating various adversarial examples and using them to train the target model. Unlike the existing method, the proposed method trains the target model on adversarial examples generated by adjusting several parameters. We evaluated the method by experimentally analyzing the contents of five datasets: three classification datasets and two entailment datasets. The contributions of this study are as follows. First, we created a variety of adversarial examples and developed a proposed adversarial training method designed to make the BERT model robust against adversarial examples. Unlike the existing method, the proposed method targets the latest BERT model, and to increase the model’s robustness, the method manipulates multiple parameters to generate various types of adversarial examples to use in training the model. In this paper, we systematically present the principle behind the method as well as the components of its operation. Second, we compared the model’s accuracy for the original samples with that for the adversarial examples using five datasets and analyzed the attack success rate of the adversarial examples. In addition, we evaluated the preservation of the model’s accuracy on the original samples. Third, to evaluate the performance of the proposed method, we applied it using five datasets (AG’s News [14], a movie review dataset [15], the IMDB Large Movie Review Dataset [16], the Stanford Natural Language Inference (SNLI) corpus [17], and the Multi-Genre Natural Language Inference (MultiNLI) corpus [18]), which include both classification datasets and entailment datasets.

The remainder of this paper is organized as follows. In Section 2, studies related to the proposed method are summarized, and in Section 3, the proposed method is described. Section 4 explains the experiment and presents the evaluation of its results, and Section 5 discusses the proposed method. Finally, Section 6 concludes the paper.

This section describes the BERT model and provides a brief overview of adversarial examples.

2.1. BERT Model

The internal operation of the BERT model [12] is as follows. The input embedding of the BERT model consists of token embedding, segment embedding, and position embedding. First, for token embedding, WordPiece embedding is applied to each character unit, and if a subword appears frequently, the longest subword is made into a single unit. In other words, a subword that appears frequently becomes a unit of its own, and a word that does not appear frequently is split into subwords. Second, segment embedding combines two sentences, using a sentence separator (SEP). Owing to the limitation on the input length, the two sentences together are limited to 512 subwords or less. The training time increases with the square of the input length, and thus an appropriate input length must be set. Third, position embedding uses a self-attention model instead of the existing CNN and RNN models, by using a transformer model. Because self-attention cannot take into account the location of the input, the location information for the input token must be provided. Thus, the transformer uses positional encoding based on the sinusoid function, and BERT transforms it to positional encoding. Positional encoding simply encodes in token order, such as 0, 1, 2, 3. BERT combines the above three embeddings, normalizes the layer, applies dropout, and uses the embeddings as an input.

Pretraining of the BERT model is based on the transformer. The structure of the transformer is an encoder model, and self-attention is applied. BERT uses the masked language model (MLM) [19] and next sentence prediction (NSP) [20] to learn the characteristics of the language well. MLM discards a token at random from the input sentence and proceeds with learning by matching the token. NSP, given two sentences, predicts their order. For fine-tuning of NLI and QA, the relationship between the two sentences should be considered. BERT’s performance is also improved by the use of transfer learning.

2.2. Adversarial Examples

The adversarial example was proposed by Szegedy et al. [5]. Adversarial examples can be classified according to the target of the attack, information available about the model, and the method of creation.

2.2.1. Target of Attack

Adversarial examples can be classified into targeted adversarial examples [7, 21] and untargeted adversarial examples [22] according to whether or not a specific class is targeted by the attack for erroneous recognition by the model. The targeted adversarial example is a sample that is intended to be misrecognized by the model as a specific class targeted by the attacker. The untargeted adversarial example is a sample that is designed to be misrecognized by the model as any class other than the original (valid) one. Compared with targeted adversarial examples, untargeted adversarial examples can be generated with less distortion and fewer iterations.

2.2.2. Information about the Target Model

Adversarial examples can also be classified according to the information available about the model into white box attacks [23, 24] and black box attacks [25, 26]. A white box attack is one in which the attacker has all the information about the model. This occurs in an environment in which the structure and parameters of the model are known, as well as the probability values for the result returned for the input value. A black box attack is one in which only the result value of the input is known, with no information available about the model. In this paper, it is assumed that information about each probability value for the result returned for an input value is known.

2.2.3. Generation Methods

In the image domain, methods for generating adversarial examples have a continuous aspect in that an entire pixel is changed, but in the text domain, the process is discrete. For text, the adversarial example generation method finds the word having the highest probability of causing misrecognition of the input value as the target class and replaces it in a word-wise manner for each character. Unlike images, adversarial examples in the text domain must have no grammatical or semantic problems perceptible to humans, and analysis of the candidate words is required. Several studies have been conducted on the generation of adversarial examples in the text domain. Zhao et al. [27] proposed the generation of adversarial examples using generative adversarial nets. This method can create adversarial examples in the image or text domain. The method is designed to attack the LSTM model, which is a model that recognizes text. In their study, an adversarial example similar to the original sample was created that was not abnormal in grammar or meaning and that caused misrecognition by the model. Kim et al. [28] proposed the generation of adversarial examples for datasets in various languages, including English, German, Spanish, French, and Russian. Their proposed method generates adversarial examples for Char-CNN and LSTM models. Jin et al. [29] proposed a method of generating adversarial examples for the BERT model. In this study, performance was evaluated with five classification task datasets and two textual entailment tasks. An adversarial example was created that was similar to the original sample in grammar and meaning yet was misrecognized by the model.

3. Proposed Scheme

The adversarial training method consists of two steps: the creation of a text adversarial example and the use of this sample in training a model, as shown in Figure 1. As the target model undergoes additional training on adversarial examples randomly generated for the target model, the target model is imbued with robustness against adversarial attacks by unknown adversarial examples. In the first step, generation of an adversarial example, the target model first identifies important words that pose no grammatical problems in the original sentence. Then, candidate words for replacing the important words are generated, and these are substituted for the important words to find the word having the highest probability of being misrecognized by the target model. The adversarial example is generated by substituting the candidate word having the highest probability of misrecognition. The adversarial examples created in this way do not seem abnormal to humans in terms of their grammar and meaning, yet they are misrecognized by the model. In the second step, the target model undergoes additional training on the adversarial examples generated in this way, learning to classify them into the correct class. With this additional training, the target model becomes robust against an unknown adversarial example.

The mathematical expression of the proposed method is as follows. Let be the operation function for the target model . In the proposed method, an adversarial example that will be misrecognized as target class is generated using a random original sentence for the target model.

To generate these adversarial examples in the text domain, the following procedure is used. First, given a sentence consisting of words in word importance rank (WIR), the proposed method aims to find the keywords that have the greatest influence on the predictive model . Therefore, the selection mechanism that had the greatest influence on the change in the last prediction outcome is used. In addition, semantic similarities should be kept strong, and changes in the selection process should be kept to a minimum. After a word is deleted from , the confidence value is calculated by comparing the difference between the predicted score and the result returned by the model . The importance score is calculated through predictive changes before and after word changes. When words are ranked by importance, words such as “that” and “an” are filtered out so that they do not interfere with the grammar of the sentence.

Second, given a word of high importance in the word converter, the following steps are required to apply the substitution mechanism for that word. To generate adversarial examples, three steps (synonym extraction (SE), part of speech (POS) check, and semantic similarity check (SSC)) are required to find the most suitable word substitution, and conditions under which adversarial examples can be misclassified by the model are required.

In SE, the set of all possible alternative candidates for the selected word is collected. Let the candidates be the synonyms having the closest cosine similarity to word . Word embeddings are used to represent the words. The embedding vector is used to identify the synonyms with cosine similarities greater than the value of . In this study, was set from 20 to 50 and from 0.5 to 0.7 to control diversity and semantic similarity, respectively.

A POS check for candidate word is performed because words with the same POS must be used to preserve the grammatical characteristics of the text.

Using the remaining candidates, the SSC replaces the word in the sentence and creates an adversarial example; this is given to model to obtain the predicted score. Using a general-purpose sentence encoder, semantic similarity is calculated using a high-dimensionality vector of the sentence similarity and the cosine similarity score for the original sample and adversarial example. If the semantic similarity is higher than the specified value, the adversarial example is stored in the final candidate pool.

After a variety of adversarial examples have been generated, the finalist with the highest similarity score is selected for output. In the absence of a finalist, the SE, POS, and SSC steps are repeated as described above for the next selected word.

The target model undergoes an additional process of training on the adversarial examples created in this manner. The mathematical expression is as follows:

In the additional training process, the original label is assigned to the generated adversarial examples. By this process, the target model gains robustness against unknown adversarial examples. The details of this process are given in Algorithm 1.

4. Experimental Setup and Results

This section describes the experimental environment and shows the experimental results for the proposed method. In the experimental setup, the TensorFlow machine learning library [30] was used.

4.1. Dataset

The proposed method was verified using three text classification tasks and two textual entailments. For the text classification tasks, three datasets were used: AG’s News (AG), a movie review dataset (MR), and the IMDB Large Movie Review Dataset (IMDB) [16]. AG is classified at the sentence level and consists of four categories: world, sports, business, and science/technology. It has 120,000 training data and 7,600 test data. The MR dataset is used to determine whether a sentence is positive or negative by classification at the sentence level. It has 9,595 training data and 1,067 test data. The IMDB dataset is a movie review dataset with document-level sentiment classification (positive or negative). It has 25,000 training data and 25,000 test data.

There were two textual entailment datasets used: the Stanford Natural Language Inference (SNLI) corpus [17] and the Multi-Genre Natural Language Inference (MultiNLI) corpus [18]. The SNLI corpus consists of approximately 560,000 sentence pairs and image captions. It shows the relationship between two sentences and identifies whether it represents entailment, contradiction, or a neutral relationship. SNLI contains 550,152 training data and 10,000 test data. The MultiNLI corpus consists of audio files interpreted as a multigenre entailment dataset, including famous fiction and government reports. It contains 392,702 training data and 10,000 test data.

4.2. Experimental Setup

BERT was used as the target model, which consisted of 768 hidden units and 12 heads in 12 layers and had 110 million parameters. The maximum number of position embeddings was 512, and the vocabulary size was 30,522. The intermediate size was 3,072, and GELU [31] was used as a hidden activation function. GELU activation is a Gaussian error linear unit such as . This activation function works better as the network gets deeper and is a nonlinear function.

To generate a variety of adversarial examples using different parameter values, the similarity score threshold was set to 0.7, and the number of synonyms was set to values ranging from 20 to 50. The batch size was set to 32, and the maximum sequence length was set to 128. Adversarial examples were generated randomly for each dataset from the test data.

4.3. Experimental Results

As a measure, the accuracy refers to the rate of agreement between the original class and the result predicted by the target model. Figure 2 shows examples of original sentences and adversarial sentences from the AG dataset. As shown in the figure, adversarial sentences were created by replacing important words in the original sentence with similar words while maintaining the meaning and grammar of the original according to human perception. Before being trained on the adversarial examples, the model classified the adversarial sentences incorrectly, but after being trained on the adversarial examples, it produced the same classification for the adversarial sentences as for the corresponding unmodified original sentences.

Figure 3 shows examples of original sentences and adversarial sentences from the MR dataset. The MR dataset is used to determine whether a movie review sentence is positive or negative. As shown in the figure, adversarial sentences were created by replacing important words in the original sentence with similar words while maintaining the original meaning and grammar. Before being trained on the adversarial examples, the model classified the adversarial sentences incorrectly, but after being trained on the adversarial examples, it produced the same classification for the adversarial sentences as for the corresponding unmodified original sentences.

Figure 4 shows examples of original sentences and adversarial sentences from the IMDB dataset. Unlike the other datasets, the IMDB dataset identifies positive and negative sentences at the document level. As shown in the figure, adversarial sentences were created by replacing important words in the original sentence with similar words while maintaining the original meaning and grammar. As with the other datasets, before being trained on the adversarial examples, the model classified the adversarial sentences incorrectly, but after being trained on the adversarial examples, it produced the same classification for the adversarial sentences as for the corresponding unmodified original sentences.

Figure 5 shows examples of original sentences and adversarial sentences from the SNLI dataset. The SNLI dataset indicates the type of relationship between the first sentence and the second sentence: entailment, contradiction, or neutral. As shown in the figure, adversarial sentences were created by replacing important words in the original sentence with similar words while maintaining the original meaning and grammar. As with the other datasets, before being trained on the adversarial examples, the model misrecognized the adversarial sentences, but after being trained on the adversarial examples, it recognized the adversarial sentences as normal original sentences.

Figure 6 shows examples of original sentences and adversarial sentences from the MultiNLI dataset. The MultiNLI dataset indicates the type of relationship between the first sentence and the second sentence: entailment, contradiction, or neutral. As shown in the figure, adversarial sentences were created by replacing important words in the original sentence with similar words while maintaining the original meaning and grammar. As with the SNLI dataset, before being trained on the adversarial examples, the model misrecognized the adversarial sentences, but after being trained on the adversarial examples, it recognized the adversarial sentences as normal original sentences.

Figure 7 shows the accuracy of the baseline model and the proposed model on the original sentences. “Baseline model” refers to a model without adversarial training, and “proposed model” refers to a model to which the proposed adversarial training method has been applied. As shown in the figure, the baseline model and the proposed model were very similar in their accuracy on the original sentences. The reason the accuracy of the proposed model on the original sentences is slightly lower is that it influences the decision boundary for recognition of the original sentences because additional adversarial examples were used in training. However, because the adversarial examples were studied while the accuracy on the original sentences was maintained, the accuracy on the original sentences by the proposed model and the baseline model was very similar.

Figure 8 shows the accuracy of the baseline model and the proposed model on the adversarial sentences. As can be seen in the figure, the proposed model gives higher accuracy than the baseline model. For each dataset, 1,000 random adversarial examples were generated and used to additionally train the model. The proposed model has more than twice the performance of the baseline model except on the IMDB dataset, for which there was no relative improvement in accuracy.

Figure 9 shows the word change rate for generating adversarial examples targeting the baseline model and the proposed model. The word change rate refers to the ratio of the words changed among the words of the entire sentence when the adversarial example is generated. As the baseline model has not been trained on adversarial examples, it can be attacked with adversarial examples generated using a relatively small word change rate. On the AG dataset, there were cases for which this metric was higher for the baseline model than for the proposed model. Overall, the proposed model requires more word changes than the baseline model, and this may increase the likelihood of being recognized by humans. Therefore, the proposed model makes it more difficult to generate adversarial examples by requiring a higher word change rate to generate them.

Figure 10 shows the number of queries to the baseline model and proposed model needed to create the adversarial examples. To generate adversarial examples for an attack on a model that is robust to adversarial examples, more queries are required than for the baseline model. Because the model has been trained on adversarial examples, alternative candidate words must be found that will cause misrecognition during the generation of adversarial examples, and because their success rate is low, more iterations are required. Because of this, the proposed model is more robust against adversarial examples than the baseline model.

Regarding experimentation with the target model for the textual adversarial training, Figure 11 shows the numerical value of the loss for each dataset. As shown in the figure, the greater the number of steps, the smaller the loss value; this indicates that the parameters for the target model are optimized for use with the original samples and adversarial examples. The target model that has completed adversarial training has robustness to unknown textual adversarial examples.

We analyzed the performance of the proposed method in terms of time analysis. We used a Tesla V100 as the GPU and 53.48 GB of RAM. The CPU was an Intel(R) Xeon(R) CPU @ 2.00 GHz, and there were four CPU cores. Table 1 and Table 2 show the number of training data and the time taken in the proposed method for the target model to train the original sample and the adversarial sample for each dataset through the adversarial training method. In the table, it can be seen that the required training time of the target model differs according to the length of the dataset and the number of datasets. In the adversarial training method, it can be seen that the required training time is about twice as long because the proposed method needs to train the original sample and the adversarial sample, respectively.

5. Discussion

This section discusses the assumptions, attack considerations, accuracy, datasets, applications, and limitations of the proposed method.

5.1. Assumption

The proposed method is designed to make a model robust against unknown adversarial examples by giving it additional training after generating adversarial examples. It is assumed that an attacker can generate adversarial examples by being provided confidence values for the input data. In other words, it is assumed that the attacker performs a white box attack on the target model, meaning that the attacker has access to the confidence values for the input data.

5.2. Attack Considerations

From the results of the experiments with the proposed method, it can be seen that the target model is robust to adversarial examples. The proposed model has good accuracy on adversarial examples compared with the baseline model, which has not been trained on adversarial examples. It was demonstrated that the proposed model can correctly recognize adversarial examples. Therefore, in terms of the attack success rate, it can be seen that even if the proposed model is targeted as a white box attack, the probability of attack failure is high.

In terms of the word change rate, the proposed model requires more word changes during the generation of adversarial examples than does the baseline model. This is because it is not easy to cause misrecognition by the proposed model by changing some number of important words. Therefore, it is more difficult for an attacker to generate adversarial examples for the proposed model than for the baseline model.

In terms of queries, the proposed model requires more feedback iterations than the baseline model. Because the proposed model is robust to adversarial examples, the attacker must generate adversarial examples with a variety of word combinations, requiring multiple instances of feedback from the proposed model. In summary, the proposed model requires more feedback iterations and more word changes, making the attack more difficult and resulting in an attack success rate lower than that for the baseline model.

5.3. Accuracy on the Original Sentences

In adversarial training, it is important that the target model’s accuracy on the original sentences is maintained even though the model is being trained on adversarial sentences. The proposed model was trained on adversarial examples without causing deterioration in the range of accuracy on the original sentences. With additional adversarial training, the proposed model’s accuracy on the original sentence becomes slightly lower than the baseline model’s, but the difference is small.

5.4. Datasets

The performance of the proposed model differs for different datasets. In terms of adversarial sentences, in the case of the AG dataset, the effect of using the proposed method to defend against adversarial examples was substantial, whereas, with the IMDB dataset, the effect was minimal. This is because the IMDB dataset has a large number of cases that can be attacked, because of the large quantity of data at the document level, whereas, in the AG data, there are four classes, in contrast to other datasets that have two or three classes. In general, however, the proposed model for each dataset requires more word changes and repetitions and reduces the effectiveness of attacks while maintaining its accuracy on the original sentences. In terms of accuracy on the adversarial sentences, the proposed model provides an accuracy more than twice that of the baseline model.

5.5. Applications

In security [3236], one area of application for the proposed method is in military scenarios. In the case of important documents, even if the attacker provides them through adversarial sentences, it is possible to remove the risk of misrecognition, recognizing them instead as original sentences, through the application of the proposed method. This is important because if a secret document is misrecognized in a military scenario, the damage caused can be considerable. In addition, the proposed method can be used to build a safer text recognition model for use with medical data or in public-policy-based projects.

5.6. Limitations and Future Research

Limitations of the proposed method are that it requires a separate process to generate and perform training with adversarial examples, and the possibility of 100% defense performance is limited to white box attacks. However, it is possible to see that even in other environments, the proposed method offers a defense that is partially effective, and other methods for generating and training with adversarial examples, such as generative adversarial nets, will be an interesting topic for future research.

As a future study, a substitute model can be used when the target model does not provide a confidence value as a black box model. In this method, we can create a substitute model that is similar to the target model by repeating the query process. Once a substitute model is created, we can perform a white box attack in the proposed method.

6. Conclusion

In this paper, we have proposed an adversarial training method for the BERT model as a defense against adversarial examples. In this study, adversarial examples were generated for five datasets and used to train the target model to give it robustness against adversarial examples. The experimental results show that the baseline model had an accuracy of 88.1% on the original samples and an average accuracy of 9.2% on the adversarial sentences, whereas the proposed model maintained an average accuracy of 87.2% on the original sentences and had an average accuracy of 22.5% on the adversarial sentences for the five datasets.

In future studies, the proposed method could be extended to other datasets and other attacks [37, 38] to continue the investigation. In addition, the method could be modified to use generative adversarial nets [39] to generate the adversarial examples for use in training target models. Finally, applying the proposed method to the text domain as part of an ensemble defense method would be another interesting topic for research. It can proceed to future research on the improved defense method mixed with the existing textual defense or the defense method that is learned after generating various types of adversarial examples.

Data Availability

The data used to support the findings of this study will be available from the corresponding author upon request after acceptance.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This study was supported by the AI R&D Center of Korea Military Academy, the Hwarang-Dae Research Institute of Korea Military Academy, and the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2021R1I1A1A01040308).